Python是如何实现dictionary

wxin 发表于 2015-10-26 12:51:32

原文地址：http://www.laurentluce.com/posts/python-dictionary-implementation/　　
　　本文描述了Python是如何实现dictionary。dictionary是由主键key索引，可以被看作是关联数组，类似于STL的map。有如下的基本操作。
　　1
　　>>> d = {'a': 1, 'b': 2}
　　2
　　>>> d['c'] = 3
　　
　　3
　　>>> d
　　4
　　{'a': 1, 'b': 2, 'c': 3}
　　元素的访问方式如下：
　　01
　　>>> d['a']
　　02
　　1
　　　　03
　　>>> d['b']
　　04
　　2

　　
　　05
　　>>> d['c']
　　06
　　3
　　
　　07
　　>>> d['d']
　　08
　　Traceback (most recent call last):
　　
　　09
　　File "<stdin>", line 1, in <module>
　　10
　　KeyError: 'd'
　　如果key不存在，将会返回KeyError的异常。
Hash tables
　　Python中dictionary是以hash表实现，而hash表以数组表示，数组的索引作为主键key。Hash函数的目标是key均匀的分布于数组中，良好的hash函数能够最小化减少冲突的次数。
　　在本文中，我们采用string作为key为例子。
　　01
　　arguments: string object
　　02
　　returns: hash
　　
　　03
　　function string_hash:
　　04
　　 if hash cached:
　　
　　05
　　    return it
　　06
　　 set len to string's length
　　　　07
　　 initialize var p pointing to 1st char of string object
　　08
　　 set x to value pointed by p left shifted by 7 bits

　　
　　09
　　 while len >= 0:
　　10
　　    set var x to (1000003 * x) xor value pointed by p
　　
　　11
　　    increment pointer p
　　12
　　 set x to x xor length of string object
　　
　　13
　　 cache x as the hash so we don't need to calculate it again
　　14
　　 return x as the hash
　　如果你执行hash(‘a’)，函数将会执行string_hash()返回12416037344，在此我们假设采用的是64位系统。
　　假设数组的大小是x，该数组用来存储key/value，我们用掩码x-1计算这个数组的索引。例如，如果数组的大小是8，‘a’的索引是hash(‘a’)&7=0， ‘b’的索引是3， ‘c’的索引是2。’z’的索引是3与’b’的索引一致，由此导致冲突。

　　我们看到Python中的hash函数在key是连续的时候表现很好的性能，原因是它对于处理这种类型的数据有通用性。但是，一旦我们加入了’z’，由于数据的不连贯性产生了冲突。
　　当产生冲突的时候，我们可以采用链表来存储这对key/value，但是这样增加了查找时间，不再是O(1)的复杂度。在下一节中，我们将要具体描述一下Python中dictionary解决这种冲突的方法。
Open addressing
　　散列地址方法是解决hash冲突的探测策略。对于’z’，由于索引3已经被占用，所以我们需要寻找一个没有使用的索引，这样添加key/value的时候肯能花费更多的时间，但是查找时间依然是O(1)，这是我们所期望的结果。
　　这里采用quadratic探测序列来寻找可用的索引，代码如下：
　　　　1
　　i is the current slot index
　　2
　　set perturb to hash

　　
　　　　3
　　forever loop:
　　4
　　set i to i << 2 + i + perturb + 1

　　
　　　　5
　　set slot index to i & mask
　　6
　　if slot is free:

　　
　　　　7
　　    return it
　　8
　　right shift perturb by 5 bits

　　让我们来看看它是如何工作的，令i=3
3 -> 3 -> 5 -> 5 -> 6 -> 0…
　　索引5将被用作’z’的索引值，这不是一个很好的例子，因为我们采用了一个大小是8的数组。对于大数组，算法显示它的优势。
　　出于好奇，我们来看看当数组大小是32，mask是31时，探测序列为
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…
　　如果你想了解更多，可以察看dictobject.c的源代码。

　　接下来，让我们看看Python内部的代码是如何实现的？
Dictionary C structures
　　如下的C语言结构体用来存储一个dictionary条目：key/value对，结构体内有Hash值，key值和value值。PyObject是Python对象的基类。
　　　　1
　　typedef struct {
　　2
　　 Py_ssize_t me_hash;

　　
　　　　3
　　 PyObject *me_key;
　　4
　　 PyObject *me_value;

　　
　　　　5
　　} PyDictEntry;

　　如下的结构体代表了一个dictionary。ma_fill是已使用slot与未使用slot之和。当一对key/value被删除了，该slot被标记为未使用。ma_used是已使用slot的数目。ma_mask等于数组大小减一，用来计算slot的索引。ma_table是该数组，ma_smalltable是初始数组大小为8.
　　　　01
　　typedef struct _dictobject PyDictObject;
　　02
　　struct _dictobject {

　　
　　　　03
　　 PyObject_HEAD
　　04
　　 Py_ssize_t ma_fill;

　　
　　　　05
　　 Py_ssize_t ma_used;
　　06
　　 Py_ssize_t ma_mask;

　　
　　　　07
　　 PyDictEntry *ma_table;
　　08
　　 PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);

　　
　　　　09
　　 PyDictEntry ma_smalltable;
　　10
　　};

Dictionary initialization
　　当我们第一次创建一个dictionary的时候，将会调用PyDict_New()。我们以伪码表示该函数的主要功能。
　　　　1
　　returns new dictionary object
　　2
　　function PyDict_New:

　　
　　　　3
　　 allocate new dictionary object
　　4
　　 clear dictionary's table

　　
　　　　5
　　 set dictionary's number of used slots + dummy slots (ma_fill) to 0
　　6
　　 set dictionary's number of active slots (ma_used) to 0

　　
　　　　7
　　 set dictionary's mask (ma_value) to dictionary size - 1 = 7
　　8
　　 set dictionary's lookup function to lookdict_string

　　
　　　　9
　　 return allocated dictionary object

Adding items
　　当增加一对新的key/value时，将调用PyDict_SetItem()。该函数的参数有dictionary对象的指针和key/value对。它检测key是否为字符串以及计算hash值或者重复使用已经存在的index。Insertdict()用来增加新的key/value，在dictionary的大小被重新调整之后，而且已使用slot的数量超过数组大小的2/3.
　　为什么是2/3？这样能够保证probing序列能够快速地找到空闲的slot。稍后我们将看看调整数组大小的函数。
　　　　01
　　arguments: dictionary, key, value
　　02
　　returns: 0 if OK or -1

　　
　　　　03
　　function PyDict_SetItem:
　　04
　　 set mp to point to dictionary object

　　
　　　　05
　　 if key's hash cached:
　　06
　　    use hash

　　
　　　　07
　　 else:
　　08
　　    calculate hash

　　
　　　　09
　　 set n_used to dictionary's number of active slots (ma_used)
　　10
　　 call insertdict with dictionary object, key, hash and value

　　
　　　　11
　　 if key/value pair added successfully and capacity over 2/3:
　　12
　　    call dictresize to resize dictionary's table

　　Insertdict()使用查找函数来找到一个未使用的slot。接下来我们来看看这个函数，lookdict_string()利用hash值和掩码来计算slot的索引。如果它找不到slot索引的key等于hash&mask，它会使用扰码来探测。
　　　　01
　　arguments: dictionary object, key, hash
　　02
　　returns: dictionary entry

　　
　　　　03
　　function lookdict_string:
　　04
　　 calculate slot index based on hash and mask

　　
　　　　05
　　 if slot's key matches or slot's key is not set:
　　06
　　    returns slot's entry

　　
　　　　07
　　 if slot's key marked as dummy (was active):
　　08
　　    set freeslot to this slot's entry

　　
　　　　09
　　 else:
　　10
　　    if slot's hash equals to hash and slot's key equals to key:

　　
　　　　11
　　          return slot's entry
　　12
　　    set var freeslot to null

　　
　　　　13
　　 we are here because we couldn't find the key so we start probing
　　14
　　 set perturb to hash

　　
　　　　15
　　 forever loop:
　　16
　　    set i to i << 2 + i + perturb + 1

　　
　　　　17
　　    calculate slot index based on i and mask
　　18
　　    if slot's key is null:

　　
　　　　19
　　          if freeslot is null:
　　20
　　             return slot's entry

　　
　　　　21
　　          else:
　　22
　　             return freeslot

　　
　　　　23
　　    if slot's key equals to key or slot's hash equals to hash
　　24
　　          and slot is not marked as dummy:

　　
　　　　25
　　          return slot's entry
　　26
　　    if slot marked as dummy and freeslot is null:

　　
　　　　27
　　          set freeslot to slot's entry
　　28
　　    right shift perturb by 5 bits

　　我们在dictionary中增加如下的key/value：{‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}。流程如下：
　　A dictionary structure is allocated with internal table size of 8.

[*]PyDict_SetItem: key = ‘a’, value = 1

[*]hash = hash(‘a’) = 12416037344
[*]insertdict

[*]lookdict_string

[*]slot index = hash & mask = 12416037344 & 7 = 0
[*]slot 0 is not used so return it

[*]init entry at index 0 with key, value and hash
[*]ma_used = 1, ma_fill = 1

[*]PyDict_SetItem: key = ‘b’, value = 2

[*]hash = hash(‘b’) = 12544037731
[*]insertdict

[*]lookdict_string

[*]slot index = hash & mask = 12544037731 & 7 = 3
[*]slot 3 is not used so return it

[*]init entry at index 3 with key, value and hash
[*]ma_used = 2, ma_fill = 2

[*]PyDict_SetItem: key = ‘z’, value = 26

[*]hash = hash(‘z’) = 15616046971
[*]insertdict

[*]lookdict_string

[*]slot index = hash & mask = 15616046971 & 7 = 3
[*]slot 3 is used so probe for a different slot: 5 is free

[*]init entry at index 5 with key, value and hash
[*]ma_used = 3, ma_fill = 3

[*]PyDict_SetItem: key = ‘y’, value = 25

[*]hash = hash(‘y’) = 15488046584
[*]insertdict

[*]lookdict_string

[*]slot index = hash & mask = 15488046584 & 7 = 0
[*]slot 0 is used so probe for a different slot: 1 is free

[*]init entry at index 1 with key, value and hash
[*]ma_used = 4, ma_fill = 4

[*]PyDict_SetItem: key = ‘c’, value = 3

[*]hash = hash(‘c’) = 12672038114
[*]insertdict

[*]lookdict_string

[*]slot index = hash & mask = 12672038114 & 7 = 2
[*]slot 2 is free so return it

[*]init entry at index 2 with key, value and hash
[*]ma_used = 5, ma_fill = 5

[*]PyDict_SetItem: key = ‘x’, value = 24

[*]hash = hash(‘x’) = 15360046201
[*]insertdict

[*]lookdict_string

[*]slot index = hash & mask = 15360046201 & 7 = 1
[*]slot 1 is used so probe for a different slot: 7 is free

[*]init entry at index 7 with key, value and hash
[*]ma_used = 6, ma_fill = 6

　　This is what we have so far:

　　由于8个slot中的6已经被使用了即超过了数组容量的2/3，dictresize()被调用来分配更大的内存，该函数同时拷贝旧数据表的数据进入新数据表。
　　在这个例子中，dictresize()调用时，minused是24也就是4×ma_used。当已使用slot的数目很大的时候(超过50000)minused是2×ma_used。为什么是4倍？这样能够减少调整步骤的数目而且增加稀疏度。
　　v新数据表的大少应该大于24，它是通过把当前数据表的大小左移一位实现的，直到她大于24.例如 8 -> 16 -> 32.
　　　　01
　　arguments: dictionary object, (2 or 4) * active slots
　　02
　　returns: 0 if OK, -1 otherwise

　　
　　　　03
　　function dictresize:
　　04
　　 calculate new dictionary size:

　　
　　　　05
　　    set var newsize to dictionary size
　　06
　　    while newsize less or equal than (2 or 4) * active slots:

　　
　　　　07
　　          set newsize to newsize left shifted by 1 bit
　　08
　　 set oldtable to dictionary's table

　　
　　　　09
　　 allocate new dictionary table
　　10
　　 set dictionary's mask to newsize - 1

　　
　　　　11
　　 clear dictionary's table
　　12
　　 set dictionary's active slots (ma_used) to 0

　　
　　　　13
　　 set var i to dictionary's active + dummy slots (ma_fill)
　　14
　　 set dictionary's active + dummy slots (ma_fill) to 0

　　
　　　　15
　　 copy oldtable entries to dictionary's table using new mask
　　16
　　 return 0

　　
　　　　17
　　}

　　当数据表大小进行调整的时候所发生的：分配了一个大小是32的新数据表

　　
Removing items
　　函数PyDict_DelItem()用来删除一个条目。计算key的hash值，然后调用查找函数返回这个条目。该条目的key设置为哑元。这个哑元包含了过去使用过的key值但现在还是未使用的状态。
　　01
　　arguments: dictionary object, key
　　02
　　returns 0 if OK, -1 otherwise
　　
　　03
　　function PyDict_DelItem:
　　04
　　 if key's hash cached:
　　
　　05
　　    use hash
　　06
　　 else:
　　
　　07
　　    calculate hash
　　08
　　 look for key in dictionary using hash
　　
　　09
　　 if slot not found:
　　10
　　    return -1
　　
　　11
　　 set slot's key to dummy
　　12
　　 set slot's value to null
　　
　　13
　　 decrement dictionary active slots
　　14
　　 return 0
　　如果我要从dictionary中删除key ’c’，将以如下的数组结束。

　　注意：删除操作并没有改变数组的大小，即使已使用slot的数目远小于slot的总数目。
　　
　　Thanks

页: [1]

运维网's Archiver

Python是如何实现dictionary