Python自然语言处理学习笔记(17)：3.1 从Web和Disk上访问文本

13432878738 · 发表于 2015-4-24 09:16:58

CHAPTER 3
Processing Raw Text 处理原始文本

The most important source of texts is undoubtedly the Web. It’s convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them.
The goal of this chapter is to answer the following questions:

1. How can we write programs to access text from local files and from the Web, in order to get hold of an unlimited range of language material?
我们如何编写程序去访问来自本地文件和Web的文本，从而得到无限范围的语言材料？
2.How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
我们如何把文档分割成单独的文字和标点符号，这样我们可以执行与上一章中文本预料库相同的分析？
3. How can we write programs to produce formatted output and save it in a file?
我们应该如何编程程序来产生有格式的数和并把它保存在文件中？

In order to address(处理) these questions, we will be covering key concepts in NLP, including tokenization and stemming（包括分词和提取词干）. Along the way you will consolidate your Python knowledge and learn about strings, files, and regular expressions（本章学习：字符串，文件和正则表达式）. Since so much text on the Web is in HTML format, we will also see how to dispense（去除） with markup.

Important: From this chapter onwards（向前的，从这一章开始）, our program samples will assume you begin your interactive session or your program with the following import statements:

  >>> from __future__ import division
  >>> import nltk, re, pprint
3.1 Accessing Text from the Web and from Disk 从Web和Disk上访问文本
Electronic Books 电子书
A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/ , and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese, and Spanish (with more than 100 texts each).
Text number 2554 is an English translation of Crime and Punishment（罪与罚）, and we can access it as follows.

  >>> from urllib import urlopen
  >>> url = "http://www.gutenberg.org/files/2554/2554.txt"
  >>> raw = urlopen(url).read()
  >>> type(raw)

  >>> len(raw)
  1176831
  >>> raw[:75]
  'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'
手动设置代理：The read() process will take a few seconds as it downloads this large book. If you’re using an Internet proxy（代理） that is not correctly detected by Python, you may need to specify the proxy manually as follows:

  >>> proxies = {'http': 'http://www.someproxy.com:3128'}
  >>> raw = urlopen(url, proxies=proxies).read()
The variable raw contains a string with 1,176,831 characters. (We can see that it is a string, using type(raw).) This is the raw content of the book, including many details we are not interested in, such as whitespace, line breaks（换行）, and blank lines. Notice the \r and \n in the opening line of the file, which is how Python displays the special carriage return（回车） and line-feed（换行） characters (the file must have been created on a Windows machine，注1给出了解释). For our language processing, we want to break up the string into words and punctuation, as we saw in Chapter 1. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.（英文的分词真是容易那...）

注1 关于carriage return（回车） and line-feed（换行）
参考自：http://blog.iyunv.com/zhezhelin/article/details/2703382
在计算机还没有出现之前，有一种叫做电传打字机（Teletype Model 33）的玩意，每秒钟可以打10个字符。但是它有一个问题，就是打完一行换行的时候，要用去0.2秒，正好可以打两个字符。要是在这0.2秒里面，又有新的字符传过来，那么这个字符将丢失。
于是，研制人员想了个办法解决这个问题，就是在每行后面加两个表示结束的字符。一个叫做“回车”，告诉打字机把打印头定位在左边界；另一个叫做“换行”，告诉打字机把纸向下移一行。
这就是“换行”和“回车”的来历，从它们的英语名字上也可以看出一二。后来，计算机发明了，这两个概念也就被般到了计算机上。那时，存储器很贵，一些科学家认为在每行结尾加两个字符太浪费了，加一个就可以。于是，就出现了分歧。Unix系统里，每行结尾只有“”，即“/n”；Windows系统里面，每行结尾是“”，即“/r/n”；Mac系统里，每行结尾是“”/r。一个直接后果是，Unix/Mac系统下的文件在Windows里打开的话，所有文字会变成一行；而Windows里的文件在Unix/Mac下打开的话，在每行的结尾可能会多出一个^M符号。

  >>> tokens = nltk.word_tokenize(raw)
  >>> type(tokens)

  >>> len(tokens)
  255809 (我这是241137)
  >>> tokens[:10]
  ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
Notice that NLTK was needed for tokenization, but not for any of the earlier tasks of opening a URL and reading it into a string. If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in Chapter 1, along with the regular list operations, such as slicing:

  >>> text = nltk.Text(tokens)
  >>> type(text)

  >>> text[1020:1060]
  ['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',
  'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',
  'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly',
  ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']
  >>> text.collocations()　　
　　

Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna;  PyotrPetrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch;  Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; Andrey  Semyonovitch; old woman; Literary Archive; Dmitri Prokofitch; great  deal; United States; Praskovya Pavlovna; Porfiry Petrovitch; ear rings 　　

Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer(页脚) at the end of the file. We cannot reliably（可靠地） detect where the content begins and ends, and so（因此） have to resort（凭借） to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming（修剪） raw to be just the content and nothing else:

  >>> raw.find("PART I")
  5303
  >>> raw.rfind("End of Project Gutenberg's Crime")
  1157681
  >>> raw = raw[5303:1157681] ①
  >>> raw.find("PART I")
  0
The find() and rfind() (“reverse find 反转寻找”) methods help us get the right index values to use for slicing the string ①. We overwrite raw with this slice, so now it begins with “PART I” and goes up to (but not including) the phrase that marks the end of the content.

This was our first brush with the reality of the Web: texts found on the Web may contain unwanted material, and there may not be an automatic way to remove it.（这可能是我们第一次接触到实际的Web:Web上的文本可能包含了不想要的资料，并且可能没有自动的方法来移除它） But with a small amount of extra work we can extract the material we need.

Dealing with HTML 处理HTML

Much of the text on the Web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the later section on files. However, if you’re going to do this often, it’s easiest to get Python to do the work directly. The first step is the same as before, using urlopen. For fun we’ll pick a BBC News story called “Blondes to die out in 200 years,” an urban legend (都市传奇)passed along by the BBC as established scientific fact:

  >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
  >>> html = urlopen(url).read()
  >>> html[:60]
  '>> import feedparser
  >>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
  >>> llog['feed']['title']
  u'Language Log'
  >>> len(llog.entries)
  15
  >>> post = llog.entries[2]
  >>> post.title
  u"He's My BF"
  >>> content = post.content[0].value
  >>> content[:70]
  u'Today I was chatting with three of our visiting graduate students f'
  >>> nltk.word_tokenize(nltk.html_clean(content)) （html_clean貌似打印错误？）
  >>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))  [u'Today', u'I', u'was', u'chatting', u'with', u'three', u'of', u'our', u'visiting',
  u'graduate', u'students', u'from', u'the', u'PRC', u'.', u'Thinking', u'that', u'I',
  u'was', u'being', u'au', u'courant', u',', u'I', u'mentioned', u'the', u'expression',
  u'DUI4XIANG4', u'\u5c0d\u8c61', u'("', u'boy', u'/', u'girl', u'friend', u'"', ...]

Note that the resulting strings have a u prefix to indicate that they are Unicode strings (see Section 3.3). With some further work, we can write programs to create a small corpus of blog posts（文章）, and use this as the basis for our NLP work.

Reading Local Files 读取本地文件

In order to read a local file, we need to use Python’s built-in open() function, followed by the read() method. Supposing you have a file document.txt, you can load its contents like this:

  >>> f = open('document.txt')
  >>> raw = f.read()

Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text（纯文本）. If you are using IDLE, select the New Window command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up（弹出式） dialogue box. Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print f.read().

Various things might have gone wrong when you tried this. If the interpreter couldn’t find your file, you would have seen an error like this:

  >>> f = open('document.txt')
  Traceback (most recent call last):
  File "", line 1, in -toplevel-
  f = open('document.txt')
  IOError: [Errno 2] No such file or directory: 'document.txt'
To check that the file that you are trying to open is really in the right directory, use IDLE’s Open command in the File menu; this will display a list of all the files in the
directory where IDLE is running. An alternative is to examine the current directory from within Python:

  >>> import os
  >>> os.listdir('.')
Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: open('document.txt', 'rU'). 'r' means to open the file for reading (the default), and 'U' stands for “Universal”, which lets us ignore the different conventions used for marking new-lines.
Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents of the entire file:

  >>> f.read()
  'Time flies like an arrow.\nFruit flies like a banana.\n'
Recall that the '\n' characters are newlines（换行符）; this is equivalent to pressing Enter on a keyboard and starting a new line.
We can also read a file one line at a time using a for loop:

  >>> f = open('document.txt', 'rU')
  >>> for line in f:
  ...    print line.strip()
  Time flies like an arrow.
  Fruit flies like a banana.
Here we use the strip() method to remove the newline character at the end of the input line.
NLTK’s corpus files can also be accessed using these methods. We simply have to use nltk.data.find() to get the filename for any corpus item. Then we can open and read it in the way we just demonstrated:

  >>> path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
  >>> raw = open(path, 'rU').read()
Extracting Text from PDF, MSWord, and Other Binary Formats
从PDF,MSWord和其他二进制格式中提取文本

ASCII text and HTML text are human-readable formats. Text often comes in binary formats—such as PDF and MSWord—that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats. Extracting text from multicolumn（多列） documents is particularly challenging. For one-off（一次性的） conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the Web, you can enter its URL in Google’s search box. The search result often includes a link to an HTML version of the document, which you can save as text.

Capturing User Input 捕捉用户输入
Sometimes we want to capture the text that a user inputs when she is interacting with our program. To prompt the user to type a line of input, call the Python function raw_input(). After saving the input to a variable, we can manipulate it just as we have done for other strings.

  >>> s = raw_input("Enter some text: ")
  Enter some text: On an exceptionally hot evening early in July
  >>> print "You typed", len(nltk.word_tokenize(s)), "words."
  You typed 8 words.

The NLP Pipeline NLP流水线

Figure 3-1 summarizes what we have covered in this section, including the process of building a vocabulary that we saw in Chapter 1. (One step, normalization, will be discussed in Section 3.6.)

　　

Figure 3-1. The processing pipeline: We open a URL and read its HTML content, remove the markup and select a slice of characters; this is then tokenized and optionally converted into an nltk.Text object; we can also lowercase all the words and extract the vocabulary.

There’s a lot going on in this pipeline. To understand it properly, it helps to be clear about the type of each variable that it mentions. We find out the type of any Python object x using type(x); e.g., type(1) is  since 1 is an integer.
When we load the contents of a URL or file, and when we strip out（删掉） HTML markup, we are dealing with strings, Python’s  data type (we will learn more about strings in Section 3.2):

  >>> raw = open('document.txt').read()
  >>> type(raw)

When we tokenize a string we produce a list (of words), and this is Python’s  type. Normalizing and sorting lists produces other lists:

  >>> tokens = nltk.word_tokenize(raw)
  >>> type(tokens)

  >>> words = [w.lower() for w in tokens]
  >>> type(words)

  >>> vocab = sorted(set(words))
  >>> type(vocab)

The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a string:

  >>> vocab.append('blog')
  >>> raw.append('blog')
  Traceback (most recent call last):
File "", line 1, in
  AttributeError: 'str' object has no attribute 'append' Similarly, we can concatenate strings with strings, and lists with lists, but we cannot concatenate strings with lists:

  >>> query = 'Who knows?'
  >>> beatles = ['john', 'paul', 'george', 'ringo']
  >>> query + beatles
  Traceback (most recent call last):
File "", line 1, in
  TypeError: cannot concatenate 'str' and 'list' objects
In the next section, we examine strings more closely and further explore the relationship between strings and lists.

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] Python自然语言处理学习笔记(17)：3.1 从Web和Disk上访问文本

浏览过的版块

扫码加入运维网微信交流群