设为首页 收藏本站
查看: 1127|回复: 0

[经验分享] Python自然语言处理学习笔记(17):3.1 从Web和Disk上访问文本

[复制链接]

尚未签到

发表于 2015-4-24 09:16:58 | 显示全部楼层 |阅读模式
CHAPTER 3
Processing Raw Text
处理原始文本

The most important source of texts is undoubtedly the Web. It’s convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them.
The goal of this chapter is to answer the following questions:

1. How can we write programs to access text from local files and from the Web, in order to get hold of an unlimited range of language material?
我们如何编写程序去访问来自本地文件和Web的文本,从而得到无限范围的语言材料?                                          
2.How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
我们如何把文档分割成单独的文字和标点符号,这样我们可以执行与上一章中文本预料库相同的分析?
3. How can we write programs to produce formatted output and save it in a file?
我们应该如何编程程序来产生有格式的数和并把它保存在文件中?

In order to address(处理) these questions, we will be covering key concepts in NLP, including tokenization and stemming(包括分词和提取词干). Along the way you will consolidate your Python knowledge and learn about strings, files, and regular expressions(本章学习:字符串,文件和正则表达式). Since so much text on the Web is in HTML format, we will also see how to dispense(去除) with markup.

DSC0000.gif Important: From this chapter onwards(向前的,从这一章开始), our program samples will assume you begin your interactive session or your program with the following import statements:



  >>> from __future__ import division
  >>> import nltk, re, pprint
3.1 Accessing Text from the Web and from Disk WebDisk上访问文本
Electronic Books 电子书
A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/ , and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese, and Spanish (with more than 100 texts each).
Text number 2554 is an English translation of Crime and Punishment(罪与罚), and we can access it as follows.



  >>> from urllib import urlopen
  >>> url = "http://www.gutenberg.org/files/2554/2554.txt"
  >>> raw = urlopen(url).read()
  >>> type(raw)
  
  >>> len(raw)
  1176831
  >>> raw[:75]
  'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'
手动设置代理:The read() process will take a few seconds as it downloads this large book. If you’re using an Internet proxy(代理) that is not correctly detected by Python, you may need to specify the proxy manually as follows:


  >>> proxies = {'http': 'http://www.someproxy.com:3128'}
  >>> raw = urlopen(url, proxies=proxies).read()
The variable raw contains a string with 1,176,831 characters. (We can see that it is a string, using type(raw).) This is the raw content of the book, including many details we are not interested in, such as whitespace, line breaks(换行), and blank lines. Notice the \r and \n in the opening line of the file, which is how Python displays the special carriage return(回车) and line-feed(换行) characters (the file must have been created on a Windows machine,注1给出了解释). For our language processing, we want to break up the string into words and punctuation, as we saw in Chapter 1. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.(英文的分词真是容易那...)

1 DSC0001.gif 关于carriage return(回车) and line-feed(换行)
参考自:http://blog.iyunv.com/zhezhelin/article/details/2703382
在计算机还没有出现之前,有一种叫做电传打字机(Teletype Model 33)的玩意,每秒钟可以打10个字符。但是它有一个问题,就是打完一行换行的时候,要用去0.2秒,正好可以打两个字符。要是在这0.2秒里面,又有新的字符传过来,那么这个字符将丢失。
于是,研制人员想了个办法解决这个问题,就是在每行后面加两个表示结束的字符。一个叫做“回车”,告诉打字机把打印头定位在左边界;另一个叫做“换行”,告诉打字机把纸向下移一行。
这就是“换行”和“回车”的来历,从它们的英语名字上也可以看出一二。后来,计算机发明了,这两个概念也就被般到了计算机上。那时,存储器很贵,一些科学家认为在每行结尾加两个字符太浪费了,加一个就可以。于是,就出现了分歧。Unix系统里,每行结尾只有“”,即“/n”;Windows系统里面,每行结尾是“”,即“/r/n”;Mac系统里,每行结尾是“”/r。一个直接后果是,Unix/Mac系统下的文件在Windows里打开的话,所有文字会变成一行;而Windows里的文件在Unix/Mac下打开的话,在每行的结尾可能会多出一个^M符号。



  >>> tokens = nltk.word_tokenize(raw)
  >>> type(tokens)
  
  >>> len(tokens)
  255809 (我这是241137)
  >>> tokens[:10]
  ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
Notice that NLTK was needed for tokenization, but not for any of the earlier tasks of opening a URL and reading it into a string. If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in Chapter 1, along with the regular list operations, such as slicing:



  >>> text = nltk.Text(tokens)
  >>> type(text)
  
  >>> text[1020:1060]
  ['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',
  'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',
  'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly',
  ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']
  >>> text.collocations()  
  

Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna;  PyotrPetrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch;  Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; Andrey  Semyonovitch; old woman; Literary Archive; Dmitri Prokofitch; great  deal; United States; Praskovya Pavlovna; Porfiry Petrovitch; ear rings   

Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer(页脚) at the end of the file. We cannot reliably(可靠地) detect where the content begins and ends, and so(因此) have to resort(凭借) to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming(修剪) raw to be just the content and nothing else:



  >>> raw.find("PART I")
  5303
  >>> raw.rfind("End of Project Gutenberg's Crime")
  1157681
  >>> raw = raw[5303:1157681] ①
  >>> raw.find("PART I")
  0
The find() and rfind() (“reverse find 反转寻找”) methods help us get the right index values to use for slicing the string ①. We overwrite raw with this slice, so now it begins with “PART I” and goes up to (but not including) the phrase that marks the end of the content.

This was our first brush with the reality of the Web: texts found on the Web may contain unwanted material, and there may not be an automatic way to remove it.(这可能是我们第一次接触到实际的Web:Web上的文本可能包含了不想要的资料,并且可能没有自动的方法来移除它) But with a small amount of extra work we can extract the material we need.

Dealing with HTML 处理HTML

Much of the text on the Web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the later section on files. However, if you’re going to do this often, it’s easiest to get Python to do the work directly. The first step is the same as before, using urlopen. For fun we’ll pick a BBC News story called “Blondes to die out in 200 years,” an urban legend (都市传奇)passed along by the BBC as established scientific fact:



  >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
  >>> html = urlopen(url).read()
  >>> html[:60]
  '>> import feedparser
  >>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
  >>> llog['feed']['title']
  u'Language Log'
  >>> len(llog.entries)
  15
  >>> post = llog.entries[2]
  >>> post.title
  u"He's My BF"
  >>> content = post.content[0].value
  >>> content[:70]
  u'Today I was chatting with three of our visiting graduate students f'
  >>> nltk.word_tokenize(nltk.html_clean(content)) (html_clean貌似打印错误?)
  >>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))  [u'Today', u'I', u'was', u'chatting', u'with', u'three', u'of', u'our', u'visiting',
  u'graduate', u'students', u'from', u'the', u'PRC', u'.', u'Thinking', u'that', u'I',
  u'was', u'being', u'au', u'courant', u',', u'I', u'mentioned', u'the', u'expression',
  u'DUI4XIANG4', u'\u5c0d\u8c61', u'("', u'boy', u'/', u'girl', u'friend', u'"', ...]
   
Note that the resulting strings have a u prefix to indicate that they are Unicode strings (see Section 3.3). With some further work, we can write programs to create a small corpus of blog posts(文章), and use this as the basis for our NLP work.

Reading Local Files 读取本地文件

In order to read a local file, we need to use Python’s built-in open() function, followed by the read() method. Supposing you have a file document.txt, you can load its contents like this:



  >>> f = open('document.txt')
  >>> raw = f.read()

Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text(纯文本). If you are using IDLE, select the New Window command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up(弹出式) dialogue box. Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print f.read().

Various things might have gone wrong when you tried this. If the interpreter couldn’t find your file, you would have seen an error like this:


  >>> f = open('document.txt')
  Traceback (most recent call last):
  File "", line 1, in -toplevel-
  f = open('document.txt')
  IOError: [Errno 2] No such file or directory: 'document.txt'
To check that the file that you are trying to open is really in the right directory, use IDLE’s Open command in the File menu; this will display a list of all the files in the
directory where IDLE is running. An alternative is to examine the current directory from within Python:


  >>> import os
  >>> os.listdir('.')
Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: open('document.txt', 'rU'). 'r' means to open the file for reading (the default), and 'U' stands for “Universal”, which lets us ignore the different conventions used for marking new-lines.
Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents of the entire file:


  >>> f.read()
  'Time flies like an arrow.\nFruit flies like a banana.\n'
Recall that the '\n' characters are newlines(换行符); this is equivalent to pressing Enter on a keyboard and starting a new line.
We can also read a file one line at a time using a for loop:


  >>> f = open('document.txt', 'rU')
  >>> for line in f:
  ...     print line.strip()
  Time flies like an arrow.
  Fruit flies like a banana.
Here we use the strip() method to remove the newline character at the end of the input line.
NLTK’s corpus files can also be accessed using these methods. We simply have to use nltk.data.find() to get the filename for any corpus item. Then we can open and read it in the way we just demonstrated:


  >>> path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
  >>> raw = open(path, 'rU').read()
Extracting Text from PDF, MSWord, and Other Binary Formats
PDF,MSWord和其他二进制格式中提取文本

ASCII text and HTML text are human-readable formats. Text often comes in binary formats—such as PDF and MSWord—that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats. Extracting text from multicolumn(多列) documents is particularly challenging. For one-off(一次性的) conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the Web, you can enter its URL in Google’s search box. The search result often includes a link to an HTML version of the document, which you can save as text.

Capturing User Input 捕捉用户输入
Sometimes we want to capture the text that a user inputs when she is interacting with our program. To prompt the user to type a line of input, call the Python function raw_input(). After saving the input to a variable, we can manipulate it just as we have done for other strings.


  >>> s = raw_input("Enter some text: ")
  Enter some text: On an exceptionally hot evening early in July
  >>> print "You typed", len(nltk.word_tokenize(s)), "words."
  You typed 8 words.

The NLP Pipeline   NLP流水线

Figure 3-1 summarizes what we have covered in this section, including the process of building a vocabulary that we saw in Chapter 1. (One step, normalization, will be discussed in Section 3.6.)


DSC0002.jpg   

Figure 3-1. The processing pipeline: We open a URL and read its HTML content, remove the markup and select a slice of characters; this is then tokenized and optionally converted into an nltk.Text object; we can also lowercase all the words and extract the vocabulary.

There’s a lot going on in this pipeline. To understand it properly, it helps to be clear about the type of each variable that it mentions. We find out the type of any Python object x using type(x); e.g., type(1) is  since 1 is an integer.
When we load the contents of a URL or file, and when we strip out(删掉) HTML markup, we are dealing with strings, Python’s  data type (we will learn more about strings in Section 3.2):


  >>> raw = open('document.txt').read()
  >>> type(raw)
   
When we tokenize a string we produce a list (of words), and this is Python’s  type. Normalizing and sorting lists produces other lists:


  >>> tokens = nltk.word_tokenize(raw)
  >>> type(tokens)
  
  >>> words = [w.lower() for w in tokens]
  >>> type(words)
  
  >>> vocab = sorted(set(words))
  >>> type(vocab)
   
The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a string:


  >>> vocab.append('blog')
  >>> raw.append('blog')
  Traceback (most recent call last):
    File "", line 1, in
  AttributeError: 'str' object has no attribute 'append' Similarly, we can concatenate strings with strings, and lists with lists, but we cannot concatenate strings with lists:



  >>> query = 'Who knows?'
  >>> beatles = ['john', 'paul', 'george', 'ringo']
  >>> query + beatles
  Traceback (most recent call last):
File "", line 1, in
  TypeError: cannot concatenate 'str' and 'list' objects
In the next section, we examine strings more closely and further explore the relationship between strings and lists.

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-60167-1-1.html 上篇帖子: Python之实时调度任务 下篇帖子: Python IDLE快捷键一览
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表