设为首页 收藏本站
查看: 547|回复: 0

[经验分享] Python自然语言处理学习笔记(25):3.9 格式化:从列表到字符串

[复制链接]

尚未签到

发表于 2015-4-28 09:15:47 | 显示全部楼层 |阅读模式
3.9 Formatting: From Lists to Strings 格式化:从列表到字符串

  

  Often we write a program to report a single data item, such as a particular element in a corpus that meets some complicated criterion, or a single summary statistic such as a word-count or the performance of a tagger. More often, we write a program to produce a structured result; for example, a tabulation of numbers or linguistic forms, or a reformatting(格式变换) of the original data. When the results to be presented are linguistic, textual output is usually the most natural choice. However, when the results are numerical, it may be preferable to produce graphical output. In this section, you will learn about a variety of ways to present program output.(在这一节,你将会学习各种呈现程序输出的方式。)

   

  From Lists to Strings 从列表到字符串

  

  The simplest kind of structured object we use for text processing is lists of words. When we want to output these to a display or a file, we must convert these lists into strings. To do this in Python we use the join() method, and specify the string to be used as the“glue”:
  

  

  >>> silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
  >>> ' '.join(silly)
  'We called him Tortoise because he taught us .'
  >>> ';'.join(silly)
  'We;called;him;Tortoise;because;he;taught;us;.'
  >>> ''.join(silly)
  'WecalledhimTortoisebecausehetaughtus.'  

  So ' '.join(silly) means: take all the items in silly and concatenate them as one big string, using ' ' as a spacer between the items. I.e., join() is a method of the string that you want to use as the glue. (Many people find this notation for join() counter-intuitive.) The join() method only works on a list of strings—what we have been calling a text—a complex type that enjoys(享有) some privileges(权限) in Python.

   

  Strings and Formats 字符串和格式

  

  We have seen that there are two ways to display the contents of an object:
  

  

  >>> word = 'cat'
  >>> sentence = """hello
  ... world"""
  >>> print word
  cat
  >>> print sentence
  hello
  world
  >>> word
  'cat'
  >>> sentence
  'hello\nworld'
  The print command yields Python’s attempt to produce the most human-readable form of an object. The second method—naming the variable at a prompt—shows us a string that can be used to recreate this object. It is important to keep in mind that both of these are just strings, displayed for the benefit of you, the user. They do not give us any clue as to the actual internal representation of the object.

   

  There are many other useful ways to display an object as a string of characters. This may be for the benefit of a human reader, or because we want to export our data to a particular file format for use in an external program.

   

  Formatted output typically contains a combination of variables and pre-specified strings. For example, given a frequency distribution fdist, we could do:
  

  

  >>> fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
  >>> for word in fdist:
  ...     print word, '->', fdist[word], ';',
  dog -> 4 ; cat -> 3 ; snake -> 1 ;  

  Apart from the problem of unwanted whitespace, print statements that contain alternating variables and constants can be difficult to read and maintain. A better solution is to use string formatting expressions(字符串格式表达式).

  

  >>> for word in fdist:
  ...    print '%s->%d;' % (word, fdist[word]),
  dog->4; cat->3; snake->1;
   

  To understand what is going on here, let’s test out the string formatting expression on its own. (By now this will be your usual method of exploring new syntax.)

  

  >>> '%s->%d;' % ('cat', 3)
  'cat->3;'
  >>> '%s->%d;' % 'cat'
  Traceback (most recent call last):
    File "", line 1, in
  TypeError: not enough arguments for format string
   

  The special symbols %s and %d are placeholders(占位符) for strings and (decimal) integers. We can embed these inside a string, then use the % operator to combine them. Let’s unpack this code further, in order to see this behavior up close(近距离):
  

  

  >>> '%s->' % 'cat'
  'cat->'
  >>> '%d' % 3
  '3'
  >>> 'I want a %s right now' % 'coffee'
  'I want a coffee right now'
   

  We can have a number of placeholders, but following the % operator we need to specify a tuple with exactly the same number of values:

  

  >>> "%s wants a %s %s" % ("Lee", "sandwich", "for lunch")
  'Lee wants a sandwich for lunch'
   
  We can also provide the values for the placeholders indirectly. Here’s an example using a for loop:

  

  >>> template = 'Lee wants a %s right now'
  >>> menu = ['sandwich', 'spam fritter', 'pancake']
  >>> for snack in menu:
  ...     print template % snack
  ...
  Lee wants a sandwich right now
  Lee wants a spam fritter right now
  Lee wants a pancake right now  

  The %s and %d symbols are called conversion specifiers(转换说明符). They start with the % character and end with a conversion character such as s (for string) or d (for decimal integer) The string containing conversion specifiers is called a format string(格式字符串). We combine a format string with the % operator and a tuple of values to create a complete string formatting expression.

   

  Lining Things Up    排列

  

  So far our formatting strings generated output of arbitrary width on the page (or screen), such as %s and %d. We can specify a width as well, such as %6s, producing a string that is padded(填补) to width 6. It is right-justified by default①, but we can include a minus sign to make it left-justified②. In case we don’t know in advance(事前) how wide a displayed value should be, the width value can be replaced with a star in the formatting string(可以用*表示宽度值), then specified using a variable③.
  

  


  >>> '%6s' % 'dog' ①
  '   dog'
  >>> '%-6s' % 'dog' ②
  'dog   '
  >>> width = 6
  >>> '%-*s' % (width, 'dog') ③
  'dog   '
   

  Other control characters are used for decimal integers and floating-point numbers. Since the percent character % has a special interpretation in formatting strings, we have to precede(在前面) it with another % to get it in the output.
  

  

  >>> count, total = 3205, 9375
  >>> "accuracy for %d words: %2.4f%%" % (total, 100 * count / total)
  'accuracy for 9375 words: 34.1867%'  

  An important use of formatting strings is for tabulating data. Recall that in Section 2.1 we saw data being tabulated from a conditional frequency distribution. Let’s perform the tabulation ourselves, exercising full control of headings and column widths, as shown in Example 3-5. Note the clear separation between the language processing work, and the tabulation of results.

   

  Example 3-5. Frequency of modals in different sections of the Brown Corpus.
  


  def tabulate(cfdist, words, categories):
      print '%-16s' % 'Category',
      for word in words:                             # column headings

          print '%6s' % word,
      print
      for category in categories:
          print '%-16s' % category,                     # row heading

          for word in words:                          # for each word

              print '%6d' % cfdist[category][word],       # print table cell

          print                                           # end the row

  >>> from nltk.corpus import brown
  >>> cfd = nltk.ConditionalFreqDist(
  ...           (genre, word)
  ...           for genre in brown.categories()
  ...           for word in brown.words(categories=genre))
  >>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
  >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
  >>> tabulate(cfd, modals, genres)
  Category            can could    may might   must   will
  news               93     86     66     38     50    389
  religion           82     59     78     12     54     71
  hobbies           268     58    131     22     83    264
  science_fiction    16     49      4     12      8     16
  romance            74    193     11     51     45     43
  humor              16     30      8      8      9     13
  
   

  
  Recall from the listing in Example 3-1 that we used a formatting string "%*s". This allows us to specify the width of a field using a variable.

  


  >>> '%*s' % (15, "Monty Python")
  '   Monty Python'
  We could use this to automatically customize the column to be just wide enough to accommodate all the words, using width = max(len(w) for w in words). Remember that the comma at the end of print statements adds an extra space, and this is sufficient

  to prevent the column headings from running into each other(记得在print语句后面的逗号会增加额外的空间,避免了列标题的相互影响).

   

  Writing Results to a File   把结果写入文件
  

  We have seen how to read text from files (Section 3.1). It is often useful to write output to files as well. The following code opens a file output.txt for writing, and saves the program output to the file.
  

  


  >>> output_file = open('output.txt', 'w')
  >>> words = set(nltk.corpus.genesis.words('english-kjv.txt'))
  >>> for word in sorted(words):
  ...     output_file.write(word + "\n")
   

   DSC0000.jpg Your Turn: What is the effect of appending \n to each string before we write it to the file? If you’re using a Windows machine, you may want to use word + "\r\n" instead. What happens if we do output_file.write(word)

   

  When we write non-text data to a file, we must convert it to a string first. We can do this conversion using formatting strings, as we saw earlier. Let’s write the total number of words to our file, before closing it.
  

  

  >>> len(words)
  2789
  >>> str(len(words))
  '2789'
  >>> output_file.write(str(len(words)) + "\n")
  >>> output_file.close()
   

   DSC0001.jpg Caution!

  You should avoid filenames that contain space characters, such as output file.txt, or that are identical except for case distinctions, e.g., Output.txt and output.TXT

   

  Text Wrapping 文本换行
  

  When the output of our program is text-like, instead of tabular, it will usually be necessary to wrap it so that it can be displayed conveniently. Consider the following output, which overflows its line, and which uses a complicated print statement:




>>> saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
...           'more', 'is', 'said', 'than', 'done', '.']
>>> for word in saying:
...     print word, '(' + str(len(word)) + '),',
After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1),  

  We can take care of line wrapping with the help of Python’s textwrap module. For maximum clarity(清楚) we will separate each step onto its own line:
  


  >>> from textwrap import fill
  >>> format = '%s (%d),'
  >>> pieces = [format % (word, len(word)) for word in saying]
  >>> output = ' '.join(pieces)
  >>> wrapped = fill(output)
  >>> print wrapped
  After (5), all (3), is (2), said (4), and (3), done (4), , (1), more
  (4), is (2), said (4), than (4), done (4), . (1),
  

  

  Notice that there is a linebreak between more and its following number. If we wanted to avoid this, we could redefine the formatting string so that it contained no spaces (e.g., '%s_(%d),'), then instead of printing the value of wrapped, we could print wrapped.replace('_', ' ').

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-61427-1-1.html 上篇帖子: Python入门笔记(17):错误、异常 下篇帖子: Python自然语言处理学习笔记(18):3.2 字符串:最底层的文本处理
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表