Apache Cassandra Learning Step by Step (4): Data Modeling
22 Feb 2012, by Bright Zheng (IT进行时)写在这章前面的几点牢骚或感慨:
1. 我发现建模是比较别扭的一件事情,尤其是你的脑子里都是RDBMS的ERD的时候;
2. 本人试图通过两者的建模过程体现思考要点,但感觉在NoSQL的建模上有点“那个”——如果不在大型项目上吃亏过或者直接受教于前辈,总感觉缺那么点味道;
3. 这篇是我写的最郁闷的一篇,而且可能后面需要无数个补丁,但管不了了,有错误才有感悟
5. Data Modeling
DataModeling is one of the most important things in experiencing Cassandra,especially to those who have lots of experiences with RDBMS data modeling.
Byadmiring Twissandra project,we name it as Jtwissandra as an example. If possible, I’ll try to create andimplement it and share it in GitHub.
Thisis a simple example to showcase the NoSQL concepts by admiring the Twitter viaCassandra.
5.1. Tranditional RDBMS Data Modeling
Followingare the core Entities & Relationships if we’re modeling in RDBMS concepts.
Hereare some pseudo codes for demonstrating the business logic/requirements:
1. Adding a new user:
USER.insert(user_id, user_name, user_password,create_timestamp);
2. Following a friend:
FRIEND.insert(user_id, followed_id, create_timestamp)
as ($current_user_id,user_id, create_timestamp);
FOLLOWER.insert(user_id, follower_id,create_timestamp)
as(user_id, $current_user_id, create_timestamp);
3. Tweetting:
FRIEND.insert(user_id, followed_id,create_timestamp)
as ($current_user_id, user_id,create_timestamp);
FOLLOWER.insert(user_id, follower_id,create_timestamp)
as(user_id, $current_user_id, create_timestamp);
4. Getting Tweets (that are twitted byself and friends):
select * from TWEET t
where
t.user_id= $current_user_id
ort.user_id in (
selectfollowed_id from FRIEND
whereuser_id = $current_user_id
)
Comment:: What a bottleneck is here!! That’s also the mostimportant reason why Twitter has to migrate to NoSQL solutions.
5.2. NoSQL Data Modeling
Beforewe go deeper of NoSQL data modeling with Cassandre, we must understand the keydesign points of it.
1. Cassandra is akey-value based model
2. Cassandrasupports more complex modeling by importing the concept of Super Column
3. The data can bestored in two ways: as column names or as values (it’s really confusing for thebeginners sometimes, but you will be free if you understand more especially onthe indexing)
4. The Columns,normal Columns or Super ones, in the Column Family is sorted by Column Names,not values
Solet’s get started.
Weneed to create the Keyspace first.
create keyspace JTWISSANDRA
with placement_strategy ='org.apache.cassandra.locator.SimpleStrategy'
and strategy_options = [{replication_factor:1}];
Underthis Keyspace, we’ll be working on the data modeling one by one.
5.2.1. User
Thekey points should be under consideration:
- The key we can simplyuse Time UUID
- The user_namemust be (secondary) indexed because we may use it for search
- Thecreate_timestamp should be (secondary) indexed because we may use it for searchor some kinds of partitioning
- The passwordmust be encoded as base64. No more CSDN story please.
Sothe sample data model will be as following:
ColumnFamily: USER
Key
Columns
550e8400-e29b-41d4-a716-446655440000
name
value
“user_name”
“itstarting”
“password”
"******"
“create_timestamp”
1329836819890000
550e8400-e29b-41d4-a716-446655440001
name
value
“user_name”
“test1”
“password”
"******"
“create_timestamp”
1329836819890001
Hereis the create script:
create column family USER
with comparator = UTF8Type
and key_validation_class = UTF8Type
and default_validation_class = UTF8Type
and column_metadata = [
{column_name: user_name, validation_class: UTF8Type,
index_name:user_name_idx, index_type:KEYS }
{column_name: user_password, validation_class:UTF8Type}
{column_name: create_timestamp, validation_class: LongType,
index_name:create_timestamp_idx, index_type:KEYS}
];
Andthe insert script/CLI for showcase only:
// insert user 550e8400-e29b-41d4-a716-446655440000
set USER[‘550e8400-e29b-41d4-a716-446655440000’][‘user_name’] = ‘itstarting’;
set USER[‘550e8400-e29b-41d4-a716-446655440000’][‘password’] = ‘111222’;
set USER[‘550e8400-e29b-41d4-a716-446655440000’][‘create_timestamp’] = 1329836819890000;
// insert user 550e8400-e29b-41d4-a716-446655440001
set USER[‘550e8400-e29b-41d4-a716-446655440001’][‘user_name’] = ‘test1;
set USER[‘550e8400-e29b-41d4-a716-446655440001’][‘password’] = ‘222111’;
set USER[‘550e8400-e29b-41d4-a716-446655440001’][‘create_timestamp’] = 1329836819890001;
5.2.2. Friend
Thefriends mean: who are the user X following?
Thekey points should be under consideration:
- The key shouldbe the uuid of the user X
- The timestampwhen the relationship is built is the column (for friend sorting) and thefriend’s uuid is the value. Wow again here. Right?
Let’ssay the two users we created are friends each other.
Sothe sample data model will be as following:
ColumnFamily: FRIEND
Key
Columns
550e8400-e29b-41d4-a716-446655440000
name
value
“1329836819859000”
“550e8400-e29b-41d4-a716-446655440001”
If the guy has more friends,insert colums here
550e8400-e29b-41d4-a716-446655440001
name
value
“1329836819781000”
“550e8400-e29b-41d4-a716-446655440000”
If the guy has more friends,insert colums here
Thefirst record means the user X is 550e8400-e29b-41d4-a716-446655440000 andhis/her friend is 550e8400-e29b-41d4-a716-446655440001 and the relationship isestablished at timestamp of 1329836819859000.
Hereis the create script:
create column family FRIEND
with comparator = UTF8Type LongType
and key_validation_class = UTF8Type
and default_validation_class = UTF8Type;
Nomore column name definitions here? Yes, Cassandra is a so-called schema-freedata store. Wow!
Andthe insert script/CLI for showcase only:
set FRIEND[‘550e8400-e29b-41d4-a716-446655440000’][‘1329836819859000’]
= ‘550e8400-e29b-41d4-a716-446655440001;
set FRIEND[‘550e8400-e29b-41d4-a716-446655440001’][‘1329836819781000’]
= ‘550e8400-e29b-41d4-a716-446655440000;
5.2.3. Follower
TheFollower is a reversed concept compared to Friend: Who are following user X?
Thekey points should be under consideration:
- The key shouldbe the uuid of the user X
- The timestampwhen the relationship is built is the column (for follower sorting) and the follower’suser uuid is the value.
Actuallythe logic should be within the same transaction of friend creation. So we’dlike to follow the sample in Friend chapter.
Sothe sample data model will be as following:
ColumnFamily: FOLLOWER
Key
Columns
550e8400-e29b-41d4-a716-446655440001
name
value
“1329836819859000”
“550e8400-e29b-41d4-a716-446655440000”
If the guy has more friends,insert colums here
550e8400-e29b-41d4-a716-446655440000
name
value
“1329836819781000”
“550e8400-e29b-41d4-a716-446655440001”
If the guy has more friends,insert colums here
Hereis the create script:
create column family FOLLOWER
with comparator = UTF8Type LongType
and key_validation_class = UTF8Type
and default_validation_class = UTF8Type;
Andthe insert script/CLI for showcase only:
set FOLLOWER[‘550e8400-e29b-41d4-a716-446655440001’][‘1329836819859000’’]
= ‘550e8400-e29b-41d4-a716-446655440000;
set FOLLOWER[‘550e8400-e29b-41d4-a716-446655440000’][‘329836819781000’]
= ‘550e8400-e29b-41d4-a716-446655440001;
5.2.4. Tweet &Timeline
Thetweets are the soul of Twitter.
Thekey points should be under consideration:
- How to get mytweets?
- How to get myfriends’ tweets without join?
- How to sort alltweets including mine and my friends’.
That’swhy Twitter imported the concept of Timeline.
Let’simagine something like this (please correct me if I’mwrong on following discussions):
<!--
Copied from: http://user.services.openoffice.org/en/forum/viewtopic.php?f=9&t=32185
Allthe events (tweets) are going along the time.
TheTimeline means the line with the specified user’s all related events including
- The events(tweets) I sent
- The events(tweets) my friends sent
Sothe tweets are inserted to CF of Tweet but need to add one more CF: Timeline.
CAUTION: The following learning experiences/exercises might be notcorrect, please take your own risks if you still want to read on. But ofcourse, any feedback is welcome.
5.2.4.1. Tweet
ColumnFamily: TWEET
Key
Columns
550e8400-e29b-41d4-a716-446655440011
name
value
“user_uuid”
“550e8400-e29b-41d4-a716-446655440000”
“tweet_content”
“Hello world: 11”
550e8400-e29b-41d4-a716-446655440012
name
value
“user_uuid”
“550e8400-e29b-41d4-a716-446655440000”
“tweet_content”
“Hello world: 12”
550e8400-e29b-41d4-a716-446655440021
name
value
“user_uuid”
“550e8400-e29b-41d4-a716-446655440001”
“tweet_content”
“Hello world: 21”
550e8400-e29b-41d4-a716-446655440022
name
value
“user_uuid”
“550e8400-e29b-41d4-a716-446655440001”
“tweet_content”
“Hello world: 22”
Hereis the create script:
create column family TWEET
withcomparator = UTF8Type
andkey_validation_class = UTF8Type
anddefault_validation_class = UTF8Type
andcolumn_metadata = [
{column_name:user_uuid, validation_class: UTF8Type}
{column_name:tweet_content, validation_class: UTF8Type}
];
Andthe insert script/CLI for showcase only:
setTWEET['550e8400-e29b-41d4-a716-446655440011']['user_uuid']
='550e8400-e29b-41d4-a716-446655440000';
setTWEET['550e8400-e29b-41d4-a716-446655440011']['tweet_content'] = 'Helloworld: 11';
setTWEET['550e8400-e29b-41d4-a716-446655440012']['user_uuid']
='550e8400-e29b-41d4-a716-446655440000';
setTWEET['550e8400-e29b-41d4-a716-446655440012']['tweet_content'] = 'Helloworld: 12';
setTWEET['550e8400-e29b-41d4-a716-446655440021']['user_uuid']
='550e8400-e29b-41d4-a716-446655440001';
setTWEET['550e8400-e29b-41d4-a716-446655440021']['tweet_content'] = 'Helloworld: 21';
setTWEET['550e8400-e29b-41d4-a716-446655440022']['user_uuid']
='550e8400-e29b-41d4-a716-446655440001';
set TWEET['550e8400-e29b-41d4-a716-446655440022']['tweet_content']= 'Hello world: 22';
5.2.4.2. Timeline
ColumnFamily: TIMELINE
Key
Columns
550e8400-e29b-41d4-a716-446655440000
name
value
“1329883039824000”
“550e8400-e29b-41d4-a716-446655440011”
“1329883039825000”
“550e8400-e29b-41d4-a716-446655440021”
“1329883039934000”
“550e8400-e29b-41d4-a716-446655440012”
“1329883039935000”
“550e8400-e29b-41d4-a716-446655440022”
550e8400-e29b-41d4-a716-446655440001
name
value
“1329883039824000”
“550e8400-e29b-41d4-a716-446655440011”
“1329883039825000”
“550e8400-e29b-41d4-a716-446655440021”
“1329883039934000”
“550e8400-e29b-41d4-a716-446655440012”
“1329883039935000”
“550e8400-e29b-41d4-a716-446655440022”
Hereis the create script:
create column family TIMELINE
with comparator = UTF8Type LongType
and key_validation_class = UTF8Type
and default_validation_class = UTF8Type;
Andthe insert script/CLI for showcase only:
set TIMELINE['550e8400-e29b-41d4-a716-446655440000']['1329883039824000']
= '550e8400-e29b-41d4-a716-446655440011';
set TIMELINE['550e8400-e29b-41d4-a716-446655440000']['1329883039825000']
='550e8400-e29b-41d4-a716-446655440021';
set TIMELINE['550e8400-e29b-41d4-a716-446655440000']['1329883039834000']
='550e8400-e29b-41d4-a716-446655440012';
set TIMELINE['550e8400-e29b-41d4-a716-446655440000']['1329883039835000']
='550e8400-e29b-41d4-a716-446655440022';
set TIMELINE['550e8400-e29b-41d4-a716-446655440001']['1329883039824000']
='550e8400-e29b-41d4-a716-446655440011';
set TIMELINE['550e8400-e29b-41d4-a716-446655440001']['1329883039825000']
='550e8400-e29b-41d4-a716-446655440021';
set TIMELINE['550e8400-e29b-41d4-a716-446655440001']['1329883039834000']
='550e8400-e29b-41d4-a716-446655440012';
set TIMELINE['550e8400-e29b-41d4-a716-446655440001']['1329883039835000']
='550e8400-e29b-41d4-a716-446655440022';
页:
[1]