Cassandra NoSQL - optimal data structure for domains and pages


I am starting with noSQL, watched great tutorials and explanations like this https://www.youtube.com/watch?v=tg6eIht-00M. However, I am still thinking in relational way and that is why I am asking for your help.

I have the following simple relational model that stores domains and their pages and is able to keep history of page title and description updates.

CREATE TABLE domain (
  id bigint(20) NOT NULL AUTO_INCREMENT,
  name TEXT,
  suffix TEXT,
  PRIMARY KEY (id)
) ENGINE=InnoDB;

CREATE TABLE page (
  id bigint(20) NOT NULL AUTO_INCREMENT,
  domainid bigint(20),
  url TEXT,
  PRIMARY KEY (id),
  FOREIGN KEY (domainid) REFERENCES domain(id)
) ENGINE=InnoDB;

CREATE TABLE page_update (
  id bigint(20) NOT NULL AUTO_INCREMENT,
  pageid bigint(20),
  updated TIMESTAMP,
  title TEXT,
  descr TEXT,
  PRIMARY KEY (id),
  FOREIGN KEY (pageid) REFERENCES page(id)
) ENGINE=InnoDB;

I want to transfer this model into CQL:

I should create denormalized table page and distribute it over partitions according to domain suffix (.com, .net, .de,...) and name. And set clustering index to update time.

CREATE TABLE page (
  domain_name text,
  domain_suffix text,
  page_url text,
  page_title text,
  page_descr text,
  page_updated timestamp,      
  PRIMARY KEY ((domain_suffix, domain_name), page_updated)
);

Nevertheless, I am not sure if this is optimal, because

  • I have to keep domain name, suffix and page url for each update, so there will be many duplicities.
  • Each update will produce the whole row of data instead of its portion
  • I want address domains or pages from other tables. For instance, I have another table domain_technlogy which assigns web technologies to domains

How could the optimal structure look like?

Answer

With cassandra, the right way to model is to think of the SELECT queries first, and build tables that facilitate those queries.

What queries will you need?

The schema you've provided:

CREATE TABLE page (
  domain_name text,
  domain_suffix text,
  page_url text,
  page_title text,
  page_descr text,
  page_updated timestamp,      
  PRIMARY KEY ((domain_suffix, domain_name), page_updated)
);

Will allow you to lookup up url+title+descr knowing a suffix+domain name, and return either a single record if you know page_updated, or return ALL recorded updates ordered by page_updated. Is that what you need to do? Is that all you need to do? What do you use url for?

source: stackoverflow.com
js interview questions