Python自然语言处理学习笔记(42)：5.3 使用Python字典将单词映射到属性

kidys · 发表于 2015-4-27 10:21:35

　　
5.3 Mapping Words to Properties Using PythonDictionaries
　　
使用Python字典将单词映射到属性
　　
Aswe have seen, a tagged word of the form (word, tag) is an association between aword and a part-of-speech tag. Once we start doing part-of-speech tagging, wewill be creating programs that assign a tag to a word, the tag which is mostlikely in a given context. We can think of this process as mapping from words to tags. The most natural way to store mappingsin Python uses the so-called dictionarydata type (also known as an associativearray or hash array in otherprogramming languages). In this section, we look at dictionaries and see howthey can represent a variety of language information, includingparts-of-speech.
　　
　　
Indexing ListsVersus Dictionaries 列表索引与字典
　　
Atext, as we have seen, is treated in Python as a list of words. An importantproperty of lists is that we can “look up” a particular item by giving itsindex, e.g., text1[100]. Notice how we specify a number and get back a word. Wecan think of a list as a simple kind of table, as shown in Figure 5-2.
　　
　　

Figure 5-2. List lookup: We access thecontents of a Python list with the help of an integer index.
　　
　　
Contrastthis situation with frequency distributions (Section 1.3), where we specify a wordand get back a number, e.g., fdist['monstrous'], which tells us the number of timesa given word has occurred in a text. Lookup using words is familiar to anyone whohas used a dictionary. Some more examples are shown in Figure 5-3.
　　

　　
Figure 5-3. Dictionary lookup: we accessthe entry of a dictionary using a key such as someone’s name, a web domain, oran English word; other names for dictionary are map, hashmap, hash, and associativearray.
　　

　　
Inthe case of a phonebook, we look up an entry using a name and get back anumber. When we type a domain name in a web browser, the computer looks this upto get back an IP address. A word frequency table allows us to look up a wordand find its frequency in a text collection. In all these cases, we are mappingfrom names to numbers, rather than the other way around as with a list. Ingeneral, we would like to be able to map between arbitrary types ofinformation. Table 5-4 lists a variety of linguistic objects, along with whatthey map.
　　

　　
Table 5-4. Linguistic objects asmappings from keys to values
　　
Linguisticobject       Maps from                      Maps to
　　
Document Index          Word                         List of pages (whereword is found)
　　
Thesaurus同义词       Word                         sense List ofsynonyms
　　
Dictionary                Headword             Entry(part-of-speech, sense definitions, etymology)
　　
Comparative Wordlist Gloss term          Cognates(同根词 list of words, one per language)
　　
Morph Analyzer词态  Surface form  Morphologicalanalysis (list of component morphemes)
　　

　　
Mostoften, we are mapping from a “word” to some structured object. For example, a documentindex maps from a word (which we can represent as a string) to a list of pages(representedas a list of integers). In this section, we will see how to represent such mappingsin Python.
　　

　　
Dictionaries inPython  Python字典
　　
Pythonprovides a dictionary data type that can be used for mapping between arbitrary types.It is like a conventional dictionary, in that it gives you an efficient way tolook things up. However, as we see from Table 5-4, it has a much wider range ofuses.
　　
Toillustrate, we define pos to be an empty dictionary and then add four entriesto it, specifying the part-of-speech of some words. We add entries to adictionary using the familiar square bracket notation:
　　
  >>> pos = {}
　　
  >>> pos
　　
  {}
　　
  >>> pos['colorless'] = 'ADJ' ①
　　
  >>> pos
　　
  {'colorless': 'ADJ'}
　　
  >>> pos['ideas'] = 'N'
　　
  >>> pos['sleep'] = 'V'
　　
  >>> pos['furiously'] = 'ADV'
　　
  >>> pos ②
　　
  {'furiously': 'ADV', 'ideas': 'N','colorless': 'ADJ', 'sleep': 'V'}
　　
So,for example, ①saysthat the part-of-speech of colorless is adjective, or more specifically, thatthe key 'colorless' is assigned the value 'ADJ' in dictionary pos. When weinspect the value of pos②wesee a set of key-value pairs. Once we have populated（填充） the dictionaryin this way, we can employ the keys to retrieve （检索）values:
　　
  >>> pos['ideas']
　　
  'N'
　　
  >>> pos['colorless']
　　
  'ADJ'
　　
Ofcourse, we might accidentally use a key that hasn’t been assigned a value.
　　
  >>> pos['green']
　　
  Traceback (most recent call last):
　　
File "", line 1, in ?
　　
  KeyError: 'green'
　　

　　
Thisraises an important question. Unlike lists and strings, where we can use len()to work out which integers will be legal indexes, how do we work out the legalkeys for a dictionary? If the dictionary is not too big, we can simply inspectits contents by evaluating the variable pos. As we saw earlier in line②, this gives usthe key-value pairs. Notice that they are not in the same order they were originallyentered; this is because dictionaries are not sequences but mappings (see Figure5-3), and the keys are not inherently ordered（不是序列而是映射，所以键值对没有固定的顺序）.
　　

　　
Alternatively,to just find the keys, we can either convert the dictionary to a list ① or use thedictionary in a context where a list is expected, as the parameter of sorted() ②or in a for loop③.
　　
  >>> list(pos) ①
　　
  ['ideas', 'furiously', 'colorless','sleep']
　　
  >>> sorted(pos) ②
　　
  ['colorless', 'furiously', 'ideas','sleep']
　　
  >>> [w for w in pos ifw.endswith('s')]
　　
  ['colorless', 'ideas'] ③
　　

　　
Aswell as iterating over all keys in the dictionary with a for loop, we can usethe for loop as we did for printing lists:
　　
  >>> for word in sorted(pos):
　　
  ... print word + ":", pos[word] #+和，的区别在于使用，会多一个空格
　　
  ...
　　
  colorless: ADJ
　　
  furiously: ADV
　　
  sleep: V
　　
  ideas: N
　　
Finally,the dictionary methods keys(), values(), and items() allow us to access the keys,values, and key-value pairs as separate lists. We can even sort tuples①, which ordersthem according to their first element (and if the first elements are the same,it uses their second elements).
　　
  >>> pos.keys()
　　
  ['colorless', 'furiously', 'sleep','ideas']
　　
  >>> pos.values()
　　
  ['ADJ', 'ADV', 'V', 'N']
　　
  >>> pos.items()
　　
  [('colorless', 'ADJ'), ('furiously','ADV'), ('sleep', 'V'), ('ideas', 'N')]
　　
  >>> for key, val insorted(pos.items()):
　　
  ... print key + ":", val
　　
  ...
　　
  colorless: ADJ
　　
  furiously: ADV
　　
  ideas: N
　　
  sleep: V
　　
Wewant to be sure that when we look something up in a dictionary, we get only onevalue for each key. Now suppose we try to use a dictionary to store the factthat the word sleep can be used as both a verb and a noun:
　　
  >>> pos['sleep'] = 'V'
　　
  >>> pos['sleep']
　　
  'V'
　　
  >>> pos['sleep'] = 'N'
　　
  >>> pos['sleep']
　　
  'N'
　　
Initially,pos['sleep'] is given the value 'V'. But this is immediately overwritten with thenew value, 'N'. In other words, there can be only one entry in the dictionaryfor 'sleep'. However, there is a way of storing multiple values in that entry:we use a list value,（使用列表来存储多值）e.g., pos['sleep'] = ['N', 'V']. In fact, this is what we saw in Section 2.4for the CMU Pronouncing Dictionary, which stores multiple pronunciations for asingle word.
　　

　　
DefiningDictionaries  定义字典
　　
Wecan use the same key-value pair format to create a dictionary. There are acouple of ways to do this, and we will normally use the first:
　　
  >>> pos = {'colorless': 'ADJ','ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
　　
  >>> pos = dict(colorless='ADJ',ideas='N', sleep='V', furiously='ADV')
　　
Notethat dictionary keys must be immutable types（键必须为不可变类型）, such as strings and tuples. If we tryto define a dictionary using a mutable key, we get a TypeError:
　　
  >>> pos = {['ideas', 'blogs','adventures']: 'N'}
　　
  Traceback (most recent call last):
　　
File "", line 1, in
　　
  TypeError: list objects are unhashable
　　

　　
　　
DefaultDictionaries  字典的缺省
　　
Ifwe try to access a key that is not in a dictionary, we get an error. However,it’s often useful if a dictionary can automatically create an entry for thisnew key and give it a default value, such as zero or the empty list. SincePython 2.5, a special kind of dictionary called a defaultdict has been available. (It is provided as nltk.defaultdictfor the benefit of readers who are using Python 2.4.) In order to use it, wehave to supply a parameter which can be used to create the default value, e.g.,int, float, str, list, dict, tuple先指定一个默认类型，为没有指定值的键提供缺省值.
　　
  >>> frequency =nltk.defaultdict(int)
　　
  >>> frequency['colorless'] = 4
　　
  >>> frequency['ideas']
　　
  0
　　
  >>> pos =nltk.defaultdict(list)
　　
  >>> pos['sleep'] = ['N', 'V']
　　
  >>> pos['ideas']
　　
  []
　　

　　
These default values are actuallyfunctions that convert other objects to the specified type (e.g.,int("2"), list("2")). When they are called with noparameter—say, int(), list()—they return 0 and [] respectively.
　　

　　
Thepreceding examples specified the default value of a dictionary entry to be thedefault value of a particular data type. However, we can specify any defaultvalue we like, simply by providing the name of a function that can be called withno arguments to create the required value. Let’s return to our part-of-speechexample, and create a dictionary whose default value for any entry is 'N'①. When we accessa non-existent entry②, it isautomatically added to the dictionary③.
　　
  >>> pos =nltk.defaultdict(lambda: 'N') ①
　　
  >>> pos['colorless'] = 'ADJ'
　　
  >>> pos['blog'] ②
　　
  'N'
　　
  >>> pos.items()
　　
  [('blog', 'N'), ('colorless', 'ADJ')] ③
　　

　　
　　
This example used a lambda expression,introduced in Section 4.4. This lambda expression specifies no parameters, sowe call it using parentheses with no arguments. Thus, the following definitionsof f and g are equivalent:
　　
  >>> f = lambda: 'N'
　　
  >>> f()
　　
  'N'
　　
  >>> def g():
　　
  ... return 'N'
　　
  >>> g()
　　
  'N'
　　

　　
Let’ssee how default dictionaries could be used in a more substantial language processingtask. Many language processing tasks—including tagging—struggle to correctlyprocess the hapaxes（hapax legomenon的缩写，只出现过一次的词） of a text. Theycan perform better with a fixed vocabulary and a guarantee that no new wordswill appear. We can preprocess a text to replace low-frequency words with aspecial “out of vocabulary” token, UNK, with the help of a default dictionary.(Can you work out how to do this without reading on? 缺省里设置为以上字符串，用FreqDist取前N个高频词，然后映射到字典中。然后对文本里的单词进行列表解析，用键去映射字典里的值，如果木有就返回UNK)
　　
Weneed to create a default dictionary that maps each word to its replacement. Themost frequent n words will be mapped to themselves. Everything else will bemapped to UNK.
　　

　　
  >>> alice =nltk.corpus.gutenberg.words('carroll-alice.txt')
　　
  >>> vocab =nltk.FreqDist(alice)
　　
  >>> v1000 = list(vocab)[:1000]
　　
  >>> mapping =nltk.defaultdict(lambda: 'UNK')
　　
  >>> for v in v1000:
　　
  ... mapping[v] = v
　　
  ...
　　
  >>> alice2 = [mapping[v]for v in alice]
　　
  >>> alice2[:100]
　　
  ['UNK', 'Alice', "'", 's','Adventures', 'in', 'Wonderland', 'by', 'UNK', 'UNK',
　　
  'UNK', 'UNK', 'CHAPTER', 'I', '.','UNK', 'the', 'Rabbit', '-', 'UNK', 'Alice',
　　
  'was', 'beginning', 'to', 'get', 'very','tired', 'of', 'sitting', 'by', 'her',
　　
  'sister', 'on', 'the', 'bank', ',','and', 'of', 'having', 'nothing', 'to', 'do',
　　
  ':', 'once', 'or', 'twice', 'she','had', 'UNK', 'into', 'the', 'book', 'her',
　　
  'sister', 'was', 'UNK', ',', 'but','it', 'had', 'no', 'pictures', 'or', 'UNK',
　　
  'in', 'it', ',', "'", 'and','what', 'is', 'the', 'use', 'of', 'a', 'book', ",'",
　　
  'thought', 'Alice', "'",'without', 'pictures', 'or', 'conversation', "?'", ...]
　　
  >>> len(set(alice2))
　　
  1001 #1000+1个UNK（低频词）
　　

　　
IncrementallyUpdating a Dictionary  递增地更新字典
　　
Wecan employ dictionaries to count occurrences, emulating the method for tallying(计算)wordsshown in Figure 1-3. We begin by initializing an empty defaultdict, thenprocess each part-of-speech tag in the text. If the tag hasn’t been seenbefore, it will have a zero count by default. Each time we encounter（遇到） a tag, we incrementits count using the += operator (see Example 5-3).
　　

　　
Example 5-3.Incrementally updating a dictionary, and sorting by value.
　　
  >>> counts =nltk.defaultdict(int)
　　
  >>> from nltk.corpus importbrown
　　
  >>> for (word, tag) inbrown.tagged_words(categories='news'):
　　
  ... counts[tag] += 1
　　
  ...
　　
  >>> counts['N']
　　
  22226
　　
  >>> list(counts)
　　
  ['FW', 'DET', 'WH', "''",'VBZ', 'VB+PPO', "'", ')', 'ADJ', 'PRO', '*', '-', ...]
　　
  >>> from operator importitemgetter
　　
  >>> sorted(counts.items(),key=itemgetter(1), reverse=True)
　　
  [('N', 22226), ('P', 10845), ('DET',10648), ('NP', 8336), ('V', 7313), ...]
　　
  >>> [t for t, c insorted(counts.items(), key=itemgetter(1), reverse=True)]
　　
  ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',','.', 'CNJ', 'PRO', 'ADV', 'VD', ...]
　　
Thelisting in Example 5-3 illustrates an important idiom for sorting a dictionaryby its values, to show words in decreasing order of frequency. The first parameterof sorted() is the items to sort, which is a list of tuples consisting of a POStag and a frequency. The second parameter specifies the sort key using afunction itemgetter(). In general, itemgetter(n) returns a function thatcan be called on some other sequence object to obtain the nth element（itemgetter(n)可以理解为返回一个函数，其他的序列对象可以调用它来获得第n个元素）:
　　
  >>> pair = ('NP', 8336)
　　
  >>> pair[1]
　　
  8336
　　
  >>> itemgetter(1)(pair)
　　
  8336
　　

　　
Thelast parameter of sorted() specifies that the items should be returned inreverse order, i.e., decreasing values of frequency.
　　
There’sa second useful programming idiom at the beginning of Example 5-3, where weinitialize a defaultdict and then use a for loop to update its values. Here’s aschematic（原理图）version:
　　
  >>> my_dictionary = nltk.defaultdict(functionto create default value)
　　
  >>> for item in sequence:
　　
  ...    my_dictionary[item_key] is updated with information about item
　　
Here’sanother instance of this pattern, where we index words according to their last
　　
twoletters:
　　
  >>> last_letters =nltk.defaultdict(list)  #默认为列表
　　
  >>> words =nltk.corpus.words.words('en')
　　
  >>> for word in words:
　　
  ... key = word[-2:]
　　
  ... last_letters[key].append(word) #把后两位作为key,word作为value
　　
  ...
　　
  >>> last_letters['ly']
　　
  ['abactinally', 'abandonedly','abasedly', 'abashedly', 'abashlessly', 'abbreviately',
　　
  'abdominally', 'abhorrently','abidingly', 'abiogenetically', 'abiologically', ...]
　　
  >>> last_letters['zy']
　　
  ['blazy', 'bleezy', 'blowzy', 'boozy','breezy', 'bronzy', 'buzzy', 'Chazy', ...]
　　

　　
Thefollowing example uses the same pattern to create an anagram（颠倒顺序字） dictionary.(You might experiment with the third line to get an idea of why this programworks.)
　　
  >>> anagrams =nltk.defaultdict(list)
　　
  >>> for word in words:
　　
  ... key = ''.join(sorted(word)) #把单词按字母顺序排序
　　
  ... anagrams[key].append(word)
　　
  ...
　　
  >>> anagrams['aeilnrt']
　　
  ['entrail', 'latrine', 'ratline','reliant', 'retinal', 'trenail']
　　
Sinceaccumulating words like this is such a common task, NLTK provides a more convenientway of creating a defaultdict(list), in the form of nltk.Index():
　　
  >>> anagrams =nltk.Index((''.join(sorted(w)), w) for w in words)  # 是一对（x,y）
　　
  >>> anagrams['aeilnrt']
　　
  ['entrail', 'latrine', 'ratline','reliant', 'retinal', 'trenail']
　　

　　
nltk.Indexis a defaultdict(list) with extra support for initialization. Similarly,  nltk.FreqDistis essentially a defaultdict(int) with extra support for initialization (alongwith sorting and plotting methods).
　　

　　
Complex Keys andValues  复杂的键和值
　　
Wecan use default dictionaries with complex keys and values. Let’s study therange of possible tags for a word, given the word itself and the tag of theprevious word. We will see how this information can be used by a POS tagger.
　　
  >>> pos =nltk.defaultdict(lambda: nltk.defaultdict(int))
　　
  >>> brown_news_tagged =brown.tagged_words(categories='news', simplify_tags=True)
　　
  >>> for ((w1, t1), (w2, t2)) innltk.ibigrams(brown_news_tagged): ①
　　
  ... pos[(t1, w2)][t2] += 1 ②
　　
  ...
　　
  >>> pos[('DET', 'right')] ③
　　
  defaultdict(, {'ADV':3, 'ADJ': 9, 'N': 3})
　　

　　
Thisexample uses a dictionary whose default value for an entry is a dictionary(whose default value is int(), i.e., zero). Notice how we iterated over thebigrams of the tagged corpus, processing a pair of word-tag pairs for each iteration①.Each time through the loop we updated our pos dictionary’s entry for (t1, w2),a tag and its following word②. When we look up an item in pos we must specify acompound key ③, andwe get back a dictionary object. A POS tagger could use such information todecide that the word right, whenpreceded by a determiner, should be tagged as ADJ.（有点难理解，pos应该是两个字典的嵌套，其中(t1, w2)为pos的键，t2:n为值，而t2又是n的键，n为值{(t1, w2){t2:n}}）
　　

　　
Inverting aDictionary  字典反转
　　
Dictionariessupport efficient lookup, so long as you want to get the value for any key. Ifd is a dictionary and k is a key, we type d[k] and immediately obtain thevalue. Finding a key given a value is slower and more cumbersome:
　　
  >>> counts =nltk.defaultdict(int)
　　
  >>> for word innltk.corpus.gutenberg.words('milton-paradise.txt'):
　　
  ... counts[word] += 1
　　
  ...
　　
  >>> [key for (key, value) incounts.items() if value == 32]
　　
  ['brought', 'Him', 'virtue', 'Against','There', 'thine', 'King', 'mortal','every', 'been']
　　
Ifwe expect to do this kind of “reverse lookup” often, it helps to construct adictionary that maps values to keys. In the case that no two keys have the samevalue, this is an easy thing to do. We just get all the key-value pairs in thedictionary, and create a new dictionary of value-key pairs. The next examplealso illustrates another way of initializing a dictionary pos with key-valuepairs.
　　
  >>> pos = {'colorless': 'ADJ','ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
　　
  >>> pos2 = dict((value, key) for (key, value) in pos.items())  #反转一下pair就行
　　
  >>> pos2['N']
　　
  'ideas'
　　
Let’sfirst make our part-of-speech dictionary a bit more realistic and add some morewords to pos using the dictionary update() method, to create the situationwhere multiple keys have the same value. Then the technique just shown forreverse lookup will no longer work (why not?字典是可变的，同一个键只能对应一个值，所以后面的值会覆盖前面的值). Instead, wehave to use append() to accumulate the words for each part-of-speech, asfollows:
　　
  >>> pos.update({'cats': 'N','scratch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'})
　　
  >>> pos2 =nltk.defaultdict(list)  #值的类型是list
　　
  >>> for key, value inpos.items():
　　
  ... pos2[value].append(key)
　　
  ...
　　
  >>> pos2['ADV']
　　
  ['peacefully', 'furiously']
　　
Nowwe have inverted the pos dictionary, and can look up any part-of-speech and find
　　
allwords having that part-of-speech. We can do the same thing even more simplyusing NLTK’s support for indexing, as follows（又写好了啊...）:
　　
  >>> pos2 = nltk.Index((value,key) for (key, value) in pos.items())
　　
  >>> pos2['ADV']
　　
  ['peacefully', 'furiously']
　　

　　
Asummary of Python’s dictionary methods is given in Table 5-5.
　　

　　
Table 5-5. Python’s dictionary methods:A summary of commonly used methods and idioms involving dictionaries
　　

　　
Example                                     Description
　　
d= {}                               Create an empty dictionary and assign it to d
　　
d[key]= value                Assign a value to a given dictionary key
　　
d.keys()                            The list of keysof the dictionary
　　
list(d)                               The list of keys of the dictionary
　　
sorted(d)                         The keys of the dictionary, sorted
　　
keyin d                         Test whether a particular key is in thedictionary
　　
forkey in d                   Iterateover the keys of the dictionary
　　
d.values()                         The list of values inthe dictionary
　　
dict([(k1,v1),(k2,v2), ...]) Create a dictionary from a list of key-value pairs
　　
d1.update(d2)                Add all items from d2 to d1
　　
defaultdict(int)             A dictionary whose defaultvalue is zero
　　

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] Python自然语言处理学习笔记(42)：5.3 使用Python字典将单词映射到属性

浏览过的版块

扫码加入运维网微信交流群