redis数据类型改进和补充：zip2list和uintset

pangxia75 · 发表于 2016-12-21 08:10:06

ziplist和intset是redis的两种value格式。顾名思义，ziplist是压缩的list，intset是存放int的set。元素在里面都是被编成bytes。

ziplist的大概思路是如果存放的字符串可以被解析成int，则以int的编码保存，节省内存。编码成int时，head包含一个字节encoding，代表int的长度，分别是2bytes，4bytes，8bytes。具体的编码格式：

* |00pppppp| - 1 byte

* String value with length less than or equal to 63 bytes (6 bits).

* |01pppppp|qqqqqqqq| - 2 bytes

* String value with length less than or equal to 16383 bytes (14 bits).

* String value with length greater than or equal to 16384 bytes.

* |1100____| - 1 byte

* Integer encoded as int16_t (2 bytes).

* |1101____| - 1 byte

* Integer encoded as int32_t (4 bytes).

* |1110____| - 1 byte

* Integer encoded as int64_t (8 bytes).

intset以固定的长度编码int，有3种长度，int16，int32，int64。整个set的元素都用同一种编码，不需要元素head。这样节省一个head，当put一个int时，如果set的编码不足以容纳该元素，那么需要向上提升set的长度编码。并把存在的int重新编码拓展。这样就存在一个问题，如果set的元素大小参差不齐，那么浪费的空间会较多。

对于这两种类型的详细实现，有兴趣的同学去抠下代码或g下，网上很多分析，就不写重复的东西了。

最近的空闲时间在尝试做些redis的改进，一些想法见 http://christmaslin.iyunv.com/blog/1273189。实现一个ziplist的改进版zip2list，另外实现一个新的类型，uintset,在较多的场景里，都是使用uint。为uint专门优化还是值得的。

ziplist对int编码时，只是使用了encoding的4bit，另外的4bit浪费了。在zip2list，充分使用encoding的bit。具体的编码如下：

* |1100 0000|- 0 byte

* Integer 0.

* |1100 0001| - 1 byte

* +Integer encoded 1 bytes.(1~255)

* |1100 0010| - 2 byte

* +Integer encoded 2 bytes.

* |1100 0011| - 3 byte

* +Integer encoded 3 bytes.

* |1100 0100| - 4 byte

* +Integer encoded 4 bytes.

* |1100 0101| - 5 byte

* +Integer encoded 5 bytes.

* |1100 0110| - 6 byte

* +Integer encoded 6 bytes.

* |1100 0111| - 7 byte

* +Integer encoded 7 bytes.

* |1100 1000| - 8 byte

* +Integer encoded 8 bytes.

*

* |1110 0001| - 1 byte

* -Integer encoded 1 bytes.-(1-255)

* |1110 0010| - 2 byte

* -Integer encoded 2 bytes.

* |1110 0011| - 3 byte

* -Integer encoded 3 bytes.

* |1110 0100| - 4 byte

* -Integer encoded 4 bytes.

* |1110 0101| - 5 byte

* -Integer encoded 5 bytes.

* |1110 0110| - 6 byte

* -Integer encoded 6 bytes.

* |1110 0111| - 7 byte

* -Integer encoded 7 bytes.

* |1110 1000| - 8 byte

* -Integer encoded 8 bytes.

|1100 ----| ,value>= 0，|1110 ----| value 0,所有的value都编成uint，encoding已经带符号位了。例如 value{1} 编成 [|1100 0001| 0000 0001],value{-1} 编成 [|1110 0001| 0000 0001]. 这种编码方式一个字节都不会浪费。

对于uintset里的uint，则使用可变长编码，同样不需要使用head。可变长编码，一个int64的长度是1~10bytes。可变长编码的大概思路是一个byte只使用7bit，最后1bit用来代表是否编码结束，如果0，则结束，否则为1.这个小技巧在leveldb和Tokyo Cabinet都可以见到。（题外话，leveldb的思路和实现都值得一看，通过数据的版本号来解决读写的并发，精简的mvcc。此外，还像个单机版的hbase）。这个方案避免参差的int造成的浪费。当然，它可能也需要付出额外的字节。但在大部分情况下，对内存的利用应该比定长编码更优。当然，在性能上比intset，可能会有所损耗。对set的插入，删除，读取都需要查找，为了增加查找速度，int在里面是排序的，intet和uintset都是这样。但是intset的int编码长度固定，容易定位元素，2分查找是非常方便。而uintset的uint长度未知。在二分时不是针对uint的数目二分，而是针对set编码后的一个大byte array做2分。有2点可能造成性能损失：不是准确的二分和定位到一个byte时，要向前扫描找到uint的起始byte，虽然最多扫描不超过10byte。此外，要获取指定序号的uint，只能从起点向后或终点向前扫描（视乎序号更接近起点还是终点）。

在内存紧缺的情况下，uintset还是有存在的价值的。

这两种类型已经实现，相关的想法也提到官方社区。

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] redis数据类型改进和补充：zip2list和uintset

浏览过的版块

扫码加入运维网微信交流群