python 的字符编码和中文处理

xuyangus · 发表于 2015-4-26 05:10:00

字符串
python有两种字符串
　　byteString = "hello world! (in my default locale)"
　　unicodeString = u"hello Unicode world!"
相互转换
　　1 s = "hello normal string"
　　2 u = unicode( s, "utf-8" )
　　3 backToBytes = u.encode( "utf-8" )
　　3 backToUtf8 = backToBytes.decode(‘utf-8’) #与第二行效果相同
如何判断
　　if isinstance( s, str ): # 对Unicode strings，这个判断结果为False
　　if isinstance( s, unicode): # 对Unicode strings，这个判断结果为True
　　if isinstance( s, basestring ): # 对两种字符串，返回都为True
做个试验

样例import sys
print 'default encoding: ' , sys.getdefaultencoding()
print 'file system encoding: ' , sys.getfilesystemencoding()
print 'stdout encoding: ' , sys.stdout.encoding
print u'u"中文" is unicode: ', isinstance(u'中文',unicode)
print u'"中文" is unicode: ', isinstance('中文',unicode)

　　看输出结果，注意下列事实：　　python系统缺省的编码格式为ASCII，这个缺省编码在Python转换字符串时用的到，这里给两个例子：
　　1. a = "abc" + u"bcd", Python会如此转换"abc".decode(sys.getdefaultencoding()) 然后将两个Unicode字符合并。
　　2. print unicode('中文') , 这句话执行会出错“UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 …”，是因为Python试图用缺省编码来编码，而这个字符串不是ASCII，因此需要显示的指出，如果你的文件源类型为utf-8，则应如此：print unicode('中文','utf-8’)
　　Windows下getfilesystemencoding输出mbcs（多字节编码，windows的mbcs，也就是ansi，它会在不同语言的windows中使用不同的编码，在中文的windows中就是gb系列的编码)
　　Windows下控制台编码为cp936, 当你打印东西到控制台时Python自动做了转换。这里会引发一个有趣的问题, 试一下这个简单的例子test.py：

样例# -*- coding: utf-8 -*-
s = u'中文'
print s　　在控制台中分别运行 python test.py 和 python test.py > 1.txt
　　你会发现后者会报错，原因是打印控制台时Python会自动转换编码到sys.stdout.encoding, 而输出到文件时Python不会自动在write调用中进行内部字符转换。这个问题在PrintFails中有较详细的说明。

UTF-8编码格式

保存utf-8格式的文件
　　import codecs
　　fileObj = codecs.open( "someFile", "r", "utf-8" )
　　u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file

自己写BOM头
　　out = file( "someFile", "w" )
　　out.write( codecs.BOM_UTF8 )
　　out.write( unicodeString.encode( "utf-8" ) )
　　out.close()

自己去掉BOM头
　　对UTF-16, Python将BOM解码为空字串。然而对UTF-8, BOM被解码为一个字符，如例：

样例
　　>>> codecs.BOM_UTF16.decode( "utf16" )

　　u''

　　>>> codecs.BOM_UTF8.decode( "utf8" )

　　u'\ufeff'

　　
　　不知道为什么会这样不同，因此你需要在读文件时自己去掉BOM：

去掉BOMimport codecs
if s.beginswith( codecs.BOM_UTF8 ):
# The byte string s begins with the BOM: Do something.
# For example, decode the string as UTF-8
if u[0] == unicode( codecs.BOM_UTF8, "utf8" ):
# The unicode string begins with the BOM: Do something.
# For example, remove the character.
# Strip the BOM from the beginning of the Unicode string, if it exists
u.lstrip( unicode( codecs.BOM_UTF8, "utf8" ) )
源码文件的编码
　　关于Python对代码文件的编码处理，PEP0263 讲的很清楚，现摘录如下
　　python缺省认为文件为ASCII编码。
　　可在代码头一行或二行加入声明文件编码申明，通知python该文件的编码格式，如
# -*- coding: utf-8 –*- # 注意使用的编辑器，确保文件保存时使用了该编码格式

对于Windows这样的平台，它使用了BOM（文件头三个字节 \xef\xbb\xbf）来申明文件为utf-8编码，这种情况下：

如果文件中没有编码申明，python以utf8处理
如果有编码申明但不是utf-8, python报错

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] python 的字符编码和中文处理

浏览过的版块

扫码加入运维网微信交流群