设为首页 收藏本站
查看: 1109|回复: 0

[经验分享] Python自然语言处理学习笔记(40):5.1 使用词性标注器

[复制链接]

尚未签到

发表于 2015-4-28 09:01:30 | 显示全部楼层 |阅读模式
  CHAPTER 5
  
  Categorizing and Tagging Words
  分类和标注单词
  Back in elementary school you learned the difference between nouns, verbs, adjectives, and adverbs. These “word classes” are not just the idle invention of grammarians(文法家), but are useful categories for many language processing tasks. As we will see, they arise from(由...引起) simple analysis of the distribution of words in text. The goal of this chapter is to answer the following questions:
  
    What are lexical categories, and how are they used in natural language processing?
    什么是词汇分类,并且如何将它们用在自然语言处理中?
     
    What is a good Python data structure for storing words and their categories?
    一个良好的用于存储单词和分类的数据结构是什么?
     
    How can we automatically tag each word of a text with its word class?
   在文本中,我们如何根据词类来自动添加单词标记?
  
  Along the way, we’ll cover some fundamental techniques in NLP, including sequence labeling, n-gram models, backoff, and evaluation. These techniques are useful in many areas, and tagging gives us a simple context in which to present them. We will also see how tagging is the second step in the typical NLP pipeline, following tokenization.
  The process of classifying words into their parts-of-speech and labeling them accordingly is known as part-of-speech tagging, POS tagging(词性标注), or simply tagging. Parts-of-speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset(码集). Our emphasis in this chapter is on exploiting tags, and tagging text automatically.
  我们本章的重点在开发标记,并且自动地标记文本。
  
  5.1 Using a Tagger   使用词性标注器
  A part-of-speech tagger, or POS tagger, processes a sequence of words, and attaches a part of speech tag to each word (don’t forget to import nltk):
  


text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)
  [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
  ('completely', 'RB'), ('different', 'JJ')]
  
  Here we see that and is CC, a coordinating conjunction(并列连词); now and completely are RB, or adverbs(副词); for is IN, a preposition(介词); something is NN, a noun(名词); and different is JJ, an adjective(形容词).
  
  NLTK provides documentation for each tag, which can be queried using the tag, e.g., nltk.help.upenn_tagset('RB') or a regular expression, e.g., nltk.help.upenn_brown_tagset('NN.*'). Some corpora have README files with tagset documentation; see nltk.name.readme(), substituting in the name of the corpus.
  
  Let’s look at another example, this time including some homonyms:
  


  >>>> text = nltk.word_tokenize("And now for something completely different")
  >>>> nltk.pos_tag(text)
  [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
  ('completely', 'RB'), ('different', 'JJ')]

     
  Notice that refuse and permit both appear as a present tense verb (VBP现在时态动词) and a noun (NN). E.g., refUSE is a verb meaning “deny,” while REFuse is a noun meaning “trash” (i.e., they are not homophones它们不是同音字). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech(文本转语音)systems usually perform POS tagging.)
  
   DSC0000.jpg Your Turn: Many words, like ski and race, can be used as nouns or verbs with no difference in pronunciation. Can you think of others? Hint: think of a commonplace(常见的) object and try to put the word to before it to see if it can also be a verb, or think of an action and try to put the before it to see if it can also be a noun. Now make up a sentence with both uses of this word, and run the POS tagger on this sentence.
  我来造句:
  


'There is a lot of food to eat. '
[('There', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('lot', 'NN'), ('of', 'IN'), ('food', 'NN'), ('to', 'TO'), ('eat', 'VB'), ('.', '.')]

'The ate food is so delicious! '
[('The', 'DT'), ('ate', 'NN'), ('food', 'NN'), ('is', 'VBZ'), ('so', 'RB'), ('delicious', 'JJ'), ('!', '.')]
  
  
  Lexical categories like “noun” and part-of-speech tags like NN seem to have their uses, but the details will be obscure(晦涩的) to many readers. You might wonder what justification there is for introducing this extra level of information. Many of these categories arise from superficial analysis of the distribution of words in text. Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner 限定词). The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w' that appear in the same context, i.e. w1w'w2.
  


  >>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
  >>> text.similar('woman')
  Building word-context index...
  man time day year car moment world family house country child boy
  state job way war girl place room word
  >>> text.similar('bought')
  made said put done seen had found left given heard brought got been
  was set told took in felt that
  >>> text.similar('over')
  in on to of and for with from at by that into as up out down through
  is all about
  >>> text.similar('the')
  a his this their its her an that our any all one these my in your no
  some other and
  
  
  Observe that searching for woman finds nouns; searching for bought mostly finds verbs; searching for over generally finds prepositions; searching for the finds several determiners. A tagger can correctly identify the tags on these words in the context of a sentence, e.g., The woman bought over $150,000 worth of clothes.
  A tagger can also model our knowledge of unknown words(可以对我们未知的单词建模); for example, we can guess that scrobbling is probably a verb, with the root scrobble, and likely to occur in contexts like he was scrobbling.

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-61419-1-1.html 上篇帖子: 深入Python(3): and、or以及and-or 下篇帖子: python 关于命令参数
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表