Auto Complete with Redis
http://oldblog.antirez.com/post/autocomplete-with-redis.htmlHello! This isn't just a blog post, it is actually the Redis weekly update number eight, but we can't tell this to Google: naming posts like Redis Weekly update #... will not play well with Google indexing, so I'm starting to use blog post titles more predictive of what people will actually type into a search engine, in this specific case autocomplete redis, probably.
Too bad that I'm forced to play the game of the SE0 L00z3r, but it's for a good reason. I guess this will also work well when articles are submitted in places like Hacker News, Programming Reddit and so forth. Readers will have a simpler way to tell if they want to actually click or not on the link. Ok... end of the off topic.
I hope the topic of this post will be very interesting for many Redis users, and for many non Redis users that experimented the pain of modeling auto complete for web or mobile applications using a database that is not designed to make this a simple task.
Auto complete is a cool and useful feature for users, but it can be non trivial to implement it with good performances when the set of completions is big.
Simple auto complete
In the simplest auto completion schema you have:
[*]A set of N words, that are in some way notable words that you want to auto complete. This can be for instance the names of the biggest cities in the world in a subscription form, or the most frequent search terms in a search service. We'll try to complete a list of female names. You can find the raw file name here: female-names.txt.
[*]A prefix string, that is, what your user is writing in the input field.
Given the prefix, you want to extract a few words that start with the specified prefix.
But in what order? In many applications (like the list of cities in a web form) this order can just be lexicographic, and this is more simple and memory efficient to model in Redis, so we'll start with it, trying to evolve our approach to reach more interesting results.
So for instance if the prefix is mar, what I want to see in the completion list is mara, marabel, marcela, and so forth.
Since we want a scalable schema, we also require that this entries are returned in O(log(N)) time in the average case. A O(N) approach is not viable for us. We'll see the space complexity, that is another very important meter, in a moment.
A little known feature of sorted sets
The order of elements into a sorted set is, of course, given by the score. However this is not the only ordering parameter since sorted set elements may have the same score. Elements with the same score are sorted lexicographically, as you can see in this example:
redis> zadd zset 0 foo
(integer) 1
redis> zadd zset 0 bar
(integer) 1
redis> zadd zset 0 x
(integer) 1
redis> zadd zset 0 z
(integer) 1
redis> zadd zset 0 a
(integer) 1
redis> zrange zset 0 -1
1. "a"
2. "bar"
3. "foo"
4. "x"
5. "z"
This is a very important feature of sorted sets, for instance it guarantees O(log(N)) operations even when large clusters of elements have the same score. And this will be the foundation for our code completion strategy.
Well... that's not enough, but fortunately there is another cool feature of sorted sets, that is the ZRANK command. With this command given an element you can know the position (that is, the index) of an element in the sorted set. So what should I do if I want to know what words follow "bar" in my sorted set? (note that indexes are zero based)
redis> zrank zset bar
(integer) 1
redis> zrange zset 2 -1
1. "foo"
2. "x"
3. "z"
Hey that seems pretty close to what we want in our first completion strategy... but we need a final trick.
Completion in lexicographical ordering
So what we are able to do is looking up a word in the sorted set and obtain a list of words lexicographically following this word. This is a bit different from what we actually need, as we want to complete words just having a prefix of the word (or sentence, and in general any string).
We can have all this paying a price, that is: more space.
What we do is the following: instead of adding only full words to the sorted set, we add also all the possible prefixes of any given word. We also need a way to distinguish actual words from prefixes. So this is what we do:
[*]For every word, like "bar", we add all the prefixes: "b", "ba", "bar".
[*]For every word we finally add the word itself but with a trailing "*" character,
so that we can tell this is an actual word (of course you can use a different marker, even binary to avoid collisions). So we also add "bar*". If I add the word "foo", "bar", and "foobar" using the above rules, this is what I obtain:
redis> zrange zset 0 -1
1. "b"
2. "ba"
3. "bar"
4. "bar*"
5. "f"
6. "fo"
7. "foo"
8. "foo*"
9. "foob"
10. "fooba"
11. "foobar"
12. "foobar*"
Now we are very close to perform the completion!
If the user is writing "fo" we do the following:
redis> zrank zset fo
(integer) 5
redis> zrange zset 6 -1
1. "foo"
2. "foo*"
3. "foob"
4. "fooba"
5. "foobar"
6. "foobar*"
We just need to get all the items marked with the trailing "*', so we'll be able to show "foo" and "foobar" as possible completions.
Let's try it with a bigger dictionary
We'll use a list of 4960 female names, female-names.txt and a Ruby script to check how well our approach is working.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# compl1.rb - Redis autocomplete example
# download female-names.txt from http://antirez.com/misc/female-names.txt
require 'rubygems'
require 'redis'
r = Redis.new
# Create the completion sorted set
if !r.exists(:compl)
puts "Loading entries in the Redis DB\n"
File.new('female-names.txt').each_line{|n|
n.strip!
(1..(n.length)).each{|l|
prefix = n
r.zadd(:compl,0,prefix)
}
r.zadd(:compl,0,n+"*")
}
else
puts "NOT loading entries, there is already a 'compl' key\n"
end
# Complete the string "mar"
def complete(r,prefix,count)
results = []
rangelen = 50 # This is not random, try to get replies < MTU size
start = r.zrank(:compl,prefix)
return [] if !start
while results.length != count
range = r.zrange(:compl,start,start+rangelen-1)
start += rangelen
break if !range or range.length == 0
range.each {|entry|
minlen = .min
if entry != prefix
count = results.count
break
end
if entry[-1..-1] == "*" and results.length != count
results
页:
[1]