Python是如何实现dictionary
原文地址:http://www.laurentluce.com/posts/python-dictionary-implementation/本文描述了Python是如何实现dictionary。dictionary是由主键key索引,可以被看作是关联数组,类似于STL的map。有如下的基本操作。
1
>>> d = {'a': 1, 'b': 2}
2
>>> d['c'] = 3
3
>>> d
4
{'a': 1, 'b': 2, 'c': 3}
元素的访问方式如下:
01
>>> d['a']
02
1
03
>>> d['b']
04
2
05
>>> d['c']
06
3
07
>>> d['d']
08
Traceback (most recent call last):
09
File "<stdin>", line 1, in <module>
10
KeyError: 'd'
如果key不存在,将会返回KeyError的异常。
Hash tables
Python中dictionary是以hash表实现,而hash表以数组表示,数组的索引作为主键key。Hash函数的目标是key均匀的分布于数组中,良好的hash函数能够最小化减少冲突的次数。
在本文中,我们采用string作为key为例子。
01
arguments: string object
02
returns: hash
03
function string_hash:
04
if hash cached:
05
return it
06
set len to string's length
07
initialize var p pointing to 1st char of string object
08
set x to value pointed by p left shifted by 7 bits
09
while len >= 0:
10
set var x to (1000003 * x) xor value pointed by p
11
increment pointer p
12
set x to x xor length of string object
13
cache x as the hash so we don't need to calculate it again
14
return x as the hash
如果你执行hash(‘a’),函数将会执行string_hash()返回12416037344,在此我们假设采用的是64位系统。
假设数组的大小是x, 该数组用来存储key/value, 我们用掩码x-1计算这个数组的索引。例如,如果数组的大小是8,‘a’的索引是hash(‘a’)&7=0, ‘b’的索引是3, ‘c’的索引是2。’z’的索引是3与’b’的索引一致,由此导致冲突。
我们看到Python中的hash函数在key是连续的时候表现很好的性能,原因是它对于处理这种类型的数据有通用性。但是,一旦我们加入了’z’,由于数据的不连贯性产生了冲突。
当产生冲突的时候,我们可以采用链表来存储这对key/value,但是这样增加了查找时间,不再是O(1)的复杂度。在下一节中,我们将要具体描述一下Python中dictionary解决这种冲突的方法。
Open addressing
散列地址方法是解决hash冲突的探测策略。对于’z’,由于索引3已经被占用,所以我们需要寻找一个没有使用的索引,这样添加key/value的时候肯能花费更多的时间,但是查找时间依然是O(1),这是我们所期望的结果。
这里采用quadratic探测序列来寻找可用的索引,代码如下:
1
i is the current slot index
2
set perturb to hash
3
forever loop:
4
set i to i << 2 + i + perturb + 1
5
set slot index to i & mask
6
if slot is free:
7
return it
8
right shift perturb by 5 bits
让我们来看看它是如何工作的,令i=3
3 -> 3 -> 5 -> 5 -> 6 -> 0…
索引5将被用作’z’的索引值,这不是一个很好的例子,因为我们采用了一个大小是8的数组。对于大数组,算法显示它的优势。
出于好奇,我们来看看当数组大小是32,mask是31时,探测序列为
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…
如果你想了解更多,可以察看dictobject.c的源代码。
接下来,让我们看看Python内部的代码是如何实现的?
Dictionary C structures
如下的C语言结构体用来存储一个dictionary条目:key/value对,结构体内有Hash值,key值和value值。PyObject是Python对象的基类。
1
typedef struct {
2
Py_ssize_t me_hash;
3
PyObject *me_key;
4
PyObject *me_value;
5
} PyDictEntry;
如下的结构体代表了一个dictionary。ma_fill是已使用slot与未使用slot之和。当一对key/value被删除了,该slot被标记为未使用。ma_used是已使用slot的数目。ma_mask等于数组大小减一,用来计算slot的索引。ma_table是该数组,ma_smalltable是初始数组大小为8.
01
typedef struct _dictobject PyDictObject;
02
struct _dictobject {
03
PyObject_HEAD
04
Py_ssize_t ma_fill;
05
Py_ssize_t ma_used;
06
Py_ssize_t ma_mask;
07
PyDictEntry *ma_table;
08
PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
09
PyDictEntry ma_smalltable;
10
};
Dictionary initialization
当我们第一次创建一个dictionary的时候,将会调用PyDict_New()。我们以伪码表示该函数的主要功能。
1
returns new dictionary object
2
function PyDict_New:
3
allocate new dictionary object
4
clear dictionary's table
5
set dictionary's number of used slots + dummy slots (ma_fill) to 0
6
set dictionary's number of active slots (ma_used) to 0
7
set dictionary's mask (ma_value) to dictionary size - 1 = 7
8
set dictionary's lookup function to lookdict_string
9
return allocated dictionary object
Adding items
当增加一对新的key/value时,将调用PyDict_SetItem()。该函数的参数有dictionary对象的指针和key/value对。它检测key是否为字符串以及计算hash值或者重复使用已经存在的index。Insertdict()用来增加新的key/value,在dictionary的大小被重新调整之后,而且已使用slot的数量超过数组大小的2/3.
为什么是2/3?这样能够保证probing序列能够快速地找到空闲的slot。稍后我们将看看调整数组大小的函数。
01
arguments: dictionary, key, value
02
returns: 0 if OK or -1
03
function PyDict_SetItem:
04
set mp to point to dictionary object
05
if key's hash cached:
06
use hash
07
else:
08
calculate hash
09
set n_used to dictionary's number of active slots (ma_used)
10
call insertdict with dictionary object, key, hash and value
11
if key/value pair added successfully and capacity over 2/3:
12
call dictresize to resize dictionary's table
Insertdict()使用查找函数来找到一个未使用的slot。接下来我们来看看这个函数,lookdict_string()利用hash值和掩码来计算slot的索引。如果它找不到slot索引的key等于hash&mask,它会使用扰码来探测。
01
arguments: dictionary object, key, hash
02
returns: dictionary entry
03
function lookdict_string:
04
calculate slot index based on hash and mask
05
if slot's key matches or slot's key is not set:
06
returns slot's entry
07
if slot's key marked as dummy (was active):
08
set freeslot to this slot's entry
09
else:
10
if slot's hash equals to hash and slot's key equals to key:
11
return slot's entry
12
set var freeslot to null
13
we are here because we couldn't find the key so we start probing
14
set perturb to hash
15
forever loop:
16
set i to i << 2 + i + perturb + 1
17
calculate slot index based on i and mask
18
if slot's key is null:
19
if freeslot is null:
20
return slot's entry
21
else:
22
return freeslot
23
if slot's key equals to key or slot's hash equals to hash
24
and slot is not marked as dummy:
25
return slot's entry
26
if slot marked as dummy and freeslot is null:
27
set freeslot to slot's entry
28
right shift perturb by 5 bits
我们在dictionary中增加如下的key/value:{‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}。流程如下:
A dictionary structure is allocated with internal table size of 8.
[*]PyDict_SetItem: key = ‘a’, value = 1
[*]hash = hash(‘a’) = 12416037344
[*]insertdict
[*]lookdict_string
[*]slot index = hash & mask = 12416037344 & 7 = 0
[*]slot 0 is not used so return it
[*]init entry at index 0 with key, value and hash
[*]ma_used = 1, ma_fill = 1
[*]PyDict_SetItem: key = ‘b’, value = 2
[*]hash = hash(‘b’) = 12544037731
[*]insertdict
[*]lookdict_string
[*]slot index = hash & mask = 12544037731 & 7 = 3
[*]slot 3 is not used so return it
[*]init entry at index 3 with key, value and hash
[*]ma_used = 2, ma_fill = 2
[*]PyDict_SetItem: key = ‘z’, value = 26
[*]hash = hash(‘z’) = 15616046971
[*]insertdict
[*]lookdict_string
[*]slot index = hash & mask = 15616046971 & 7 = 3
[*]slot 3 is used so probe for a different slot: 5 is free
[*]init entry at index 5 with key, value and hash
[*]ma_used = 3, ma_fill = 3
[*]PyDict_SetItem: key = ‘y’, value = 25
[*]hash = hash(‘y’) = 15488046584
[*]insertdict
[*]lookdict_string
[*]slot index = hash & mask = 15488046584 & 7 = 0
[*]slot 0 is used so probe for a different slot: 1 is free
[*]init entry at index 1 with key, value and hash
[*]ma_used = 4, ma_fill = 4
[*]PyDict_SetItem: key = ‘c’, value = 3
[*]hash = hash(‘c’) = 12672038114
[*]insertdict
[*]lookdict_string
[*]slot index = hash & mask = 12672038114 & 7 = 2
[*]slot 2 is free so return it
[*]init entry at index 2 with key, value and hash
[*]ma_used = 5, ma_fill = 5
[*]PyDict_SetItem: key = ‘x’, value = 24
[*]hash = hash(‘x’) = 15360046201
[*]insertdict
[*]lookdict_string
[*]slot index = hash & mask = 15360046201 & 7 = 1
[*]slot 1 is used so probe for a different slot: 7 is free
[*]init entry at index 7 with key, value and hash
[*]ma_used = 6, ma_fill = 6
This is what we have so far:
由于8个slot中的6已经被使用了即超过了数组容量的2/3,dictresize()被调用来分配更大的内存,该函数同时拷贝旧数据表的数据进入新数据表。
在这个例子中,dictresize()调用时,minused是24也就是4×ma_used。当已使用slot的数目很大的时候(超过50000)minused是2×ma_used。为什么是4倍?这样能够减少调整步骤的数目而且增加稀疏度。
v新数据表的大少应该大于24,它是通过把当前数据表的大小左移一位实现的,直到她大于24.例如 8 -> 16 -> 32.
01
arguments: dictionary object, (2 or 4) * active slots
02
returns: 0 if OK, -1 otherwise
03
function dictresize:
04
calculate new dictionary size:
05
set var newsize to dictionary size
06
while newsize less or equal than (2 or 4) * active slots:
07
set newsize to newsize left shifted by 1 bit
08
set oldtable to dictionary's table
09
allocate new dictionary table
10
set dictionary's mask to newsize - 1
11
clear dictionary's table
12
set dictionary's active slots (ma_used) to 0
13
set var i to dictionary's active + dummy slots (ma_fill)
14
set dictionary's active + dummy slots (ma_fill) to 0
15
copy oldtable entries to dictionary's table using new mask
16
return 0
17
}
当数据表大小进行调整的时候所发生的:分配了一个大小是32的新数据表
Removing items
函数PyDict_DelItem()用来删除一个条目。计算key的hash值,然后调用查找函数返回这个条目。该条目的key设置为哑元。这个哑元包含了过去使用过的key值但现在还是未使用的状态。
01
arguments: dictionary object, key
02
returns 0 if OK, -1 otherwise
03
function PyDict_DelItem:
04
if key's hash cached:
05
use hash
06
else:
07
calculate hash
08
look for key in dictionary using hash
09
if slot not found:
10
return -1
11
set slot's key to dummy
12
set slot's value to null
13
decrement dictionary active slots
14
return 0
如果我要从dictionary中删除key ’c’,将以如下的数组结束。
注意:删除操作并没有改变数组的大小,即使已使用slot的数目远小于slot的总数目。
Thanks
页:
[1]