05
return it
06
set len to string's length
07
initialize var p pointing to 1st char of string object
08
set x to value pointed by p left shifted by 7 bits
09
while len >= 0:
10
set var x to (1000003 * x) xor value pointed by p
11
increment pointer p
12
set x to x xor length of string object
13
cache x as the hash so we don't need to calculate it again
14
return x as the hash
如果你执行hash(‘a’),函数将会执行string_hash()返回12416037344,在此我们假设采用的是64位系统。
假设数组的大小是x, 该数组用来存储key/value, 我们用掩码x-1计算这个数组的索引。例如,如果数组的大小是8,‘a’的索引是hash(‘a’)&7=0, ‘b’的索引是3, ‘c’的索引是2。’z’的索引是3与’b’的索引一致,由此导致冲突。
我们看到Python中的hash函数在key是连续的时候表现很好的性能,原因是它对于处理这种类型的数据有通用性。但是,一旦我们加入了’z’,由于数据的不连贯性产生了冲突。
当产生冲突的时候,我们可以采用链表来存储这对key/value,但是这样增加了查找时间,不再是O(1)的复杂度。在下一节中,我们将要具体描述一下Python中dictionary解决这种冲突的方法。 Open addressing
散列地址方法是解决hash冲突的探测策略。对于’z’,由于索引3已经被占用,所以我们需要寻找一个没有使用的索引,这样添加key/value的时候肯能花费更多的时间,但是查找时间依然是O(1),这是我们所期望的结果。
这里采用quadratic探测序列来寻找可用的索引,代码如下:
1
i is the current slot index
2
set perturb to hash
3
forever loop:
4
set i to i << 2 + i + perturb + 1
Dictionary initialization
当我们第一次创建一个dictionary的时候,将会调用PyDict_New()。我们以伪码表示该函数的主要功能。
1
returns new dictionary object
2
function PyDict_New:
3
allocate new dictionary object
4
clear dictionary's table
5
set dictionary's number of used slots + dummy slots (ma_fill) to 0
6
set dictionary's number of active slots (ma_used) to 0
7
set dictionary's mask (ma_value) to dictionary size - 1 = 7
8
set dictionary's lookup function to lookdict_string
9
return allocated dictionary object
Adding items
当增加一对新的key/value时,将调用PyDict_SetItem()。该函数的参数有dictionary对象的指针和key/value对。它检测key是否为字符串以及计算hash值或者重复使用已经存在的index。Insertdict()用来增加新的key/value,在dictionary的大小被重新调整之后,而且已使用slot的数量超过数组大小的2/3.
为什么是2/3?这样能够保证probing序列能够快速地找到空闲的slot。稍后我们将看看调整数组大小的函数。
01
arguments: dictionary, key, value
02
returns: 0 if OK or -1
03
function PyDict_SetItem:
04
set mp to point to dictionary object
05
if key's hash cached:
06
use hash
07
else:
08
calculate hash
09
set n_used to dictionary's number of active slots (ma_used)
10
call insertdict with dictionary object, key, hash and value
11
if key/value pair added successfully and capacity over 2/3:
12
call dictresize to resize dictionary's table
03
function lookdict_string:
04
calculate slot index based on hash and mask
05
if slot's key matches or slot's key is not set:
06
returns slot's entry
07
if slot's key marked as dummy (was active):
08
set freeslot to this slot's entry
09
else:
10
if slot's hash equals to hash and slot's key equals to key:
11
return slot's entry
12
set var freeslot to null
13
we are here because we couldn't find the key so we start probing
14
set perturb to hash
15
forever loop:
16
set i to i << 2 + i + perturb + 1
17
calculate slot index based on i and mask
18
if slot's key is null:
19
if freeslot is null:
20
return slot's entry
21
else:
22
return freeslot
23
if slot's key equals to key or slot's hash equals to hash
24
and slot is not marked as dummy:
25
return slot's entry
26
if slot marked as dummy and freeslot is null:
27
set freeslot to slot's entry
28
right shift perturb by 5 bits
我们在dictionary中增加如下的key/value:{‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}。流程如下:
A dictionary structure is allocated with internal table size of 8.
PyDict_SetItem: key = ‘a’, value = 1
hash = hash(‘a’) = 12416037344
insertdict
lookdict_string
slot index = hash & mask = 12416037344 & 7 = 0
slot 0 is not used so return it
init entry at index 0 with key, value and hash
ma_used = 1, ma_fill = 1
PyDict_SetItem: key = ‘b’, value = 2
hash = hash(‘b’) = 12544037731
insertdict
lookdict_string
slot index = hash & mask = 12544037731 & 7 = 3
slot 3 is not used so return it
init entry at index 3 with key, value and hash
ma_used = 2, ma_fill = 2
PyDict_SetItem: key = ‘z’, value = 26
hash = hash(‘z’) = 15616046971
insertdict
lookdict_string
slot index = hash & mask = 15616046971 & 7 = 3
slot 3 is used so probe for a different slot: 5 is free
init entry at index 5 with key, value and hash
ma_used = 3, ma_fill = 3
PyDict_SetItem: key = ‘y’, value = 25
hash = hash(‘y’) = 15488046584
insertdict
lookdict_string
slot index = hash & mask = 15488046584 & 7 = 0
slot 0 is used so probe for a different slot: 1 is free
init entry at index 1 with key, value and hash
ma_used = 4, ma_fill = 4
PyDict_SetItem: key = ‘c’, value = 3
hash = hash(‘c’) = 12672038114
insertdict
lookdict_string
slot index = hash & mask = 12672038114 & 7 = 2
slot 2 is free so return it
init entry at index 2 with key, value and hash
ma_used = 5, ma_fill = 5
PyDict_SetItem: key = ‘x’, value = 24
hash = hash(‘x’) = 15360046201
insertdict
lookdict_string
slot index = hash & mask = 15360046201 & 7 = 1
slot 1 is used so probe for a different slot: 7 is free
init entry at index 7 with key, value and hash
ma_used = 6, ma_fill = 6
This is what we have so far:
由于8个slot中的6已经被使用了即超过了数组容量的2/3,dictresize()被调用来分配更大的内存,该函数同时拷贝旧数据表的数据进入新数据表。
在这个例子中,dictresize()调用时,minused是24也就是4×ma_used。当已使用slot的数目很大的时候(超过50000)minused是2×ma_used。为什么是4倍?这样能够减少调整步骤的数目而且增加稀疏度。
v新数据表的大少应该大于24,它是通过把当前数据表的大小左移一位实现的,直到她大于24.例如 8 -> 16 -> 32.
01
arguments: dictionary object, (2 or 4) * active slots
02
returns: 0 if OK, -1 otherwise
03
function dictresize:
04
calculate new dictionary size:
05
set var newsize to dictionary size
06
while newsize less or equal than (2 or 4) * active slots:
07
set newsize to newsize left shifted by 1 bit
08
set oldtable to dictionary's table
09
allocate new dictionary table
10
set dictionary's mask to newsize - 1
11
clear dictionary's table
12
set dictionary's active slots (ma_used) to 0
13
set var i to dictionary's active + dummy slots (ma_fill)
14
set dictionary's active + dummy slots (ma_fill) to 0
15
copy oldtable entries to dictionary's table using new mask
16
return 0