解析xml文件乱码和获取不到数据的问题
---------------------------------------------------
--------------------------------------------------
实现的功能很简单,就像远程抓取www.baidu.com的网页内容,就像在浏览器里view->source看到的内容。
最初的代码:
UITextView *web = [ initWithFrame:bounds];
NSURL *url = ;
NSString *pageSource = ;
if (pageSource == nil) {
NSLog(@"nil page source");
}
else {
NSLog(@"not nil");
NSLog(pageSource);
}
web.text = pageSource;
编译运行,网页的内容是抓下来了,可是显示呢,无论是控制台还是textview里显示都是乱码。
既然是乱码,那就改编码吧,先是改成:
NSString *pageSource = ;
结果:nil page source
尝试NSUTF8StringEncoding所在定义处其它编码未果。
打印编码列表,几行code搞定,
const NSStringEncoding *encodings = ;
NSStringEncoding encoding;
int i = 0;
while ((encoding = *encodings++) != 0) {
NSLog(@"%d: %@ == 0x%x\n", i++, , encoding);
}
打印如下:
2009-06-08 23:12:44.420 MoviePre 0: Western (Mac OS Roman) == 0x1e
2009-06-08 23:12:44.421 MoviePre 1: Japanese (Mac OS) == 0x80000001
2009-06-08 23:12:44.421 MoviePre 2: Traditional Chinese (Mac OS) == 0x80000002
2009-06-08 23:12:44.422 MoviePre 3: Korean (Mac OS) == 0x80000003
2009-06-08 23:12:44.422 MoviePre 4: Arabic (Mac OS) == 0x80000004
2009-06-08 23:12:44.432 MoviePre 5: Hebrew (Mac OS) == 0x80000005
2009-06-08 23:12:44.433 MoviePre 6: Greek (Mac OS) == 0x80000006
2009-06-08 23:12:44.433 MoviePre 7: Cyrillic (Mac OS) == 0x80000007
2009-06-08 23:12:44.436 MoviePre 8: Devanagari (Mac OS) == 0x80000009
2009-06-08 23:12:44.447 MoviePre 9: Gurmukhi (Mac OS) == 0x8000000a
2009-06-08 23:12:44.447 MoviePre 10: Gujarati (Mac OS) == 0x8000000b
2009-06-08 23:12:44.447 MoviePre 11: Thai (Mac OS) == 0x80000015
2009-06-08 23:12:44.448 MoviePre 12: Simplified Chinese (Mac OS) == 0x80000019
2009-06-08 23:12:44.448 MoviePre 13: Tibetan (Mac OS) == 0x8000001a
2009-06-08 23:12:44.452 MoviePre 14: Central European (Mac OS) == 0x8000001d
2009-06-08 23:12:44.453 MoviePre 15: Symbol (Mac OS) == 0x6
2009-06-08 23:12:44.455 MoviePre 16: Dingbats (Mac OS) == 0x80000022
2009-06-08 23:12:44.455 MoviePre 17: Turkish (Mac OS) == 0x80000023
2009-06-08 23:12:44.456 MoviePre 18: Croatian (Mac OS) == 0x80000024
2009-06-08 23:12:44.464 MoviePre 19: Icelandic (Mac OS) == 0x80000025
2009-06-08 23:12:44.467 MoviePre 20: Romanian (Mac OS) == 0x80000026
2009-06-08 23:12:44.467 MoviePre 21: Celtic (Mac OS) == 0x80000027
2009-06-08 23:12:44.468 MoviePre 22: Gaelic (Mac OS) == 0x80000028
2009-06-08 23:12:44.469 MoviePre 23: Keyboard Symbols (Mac OS) == 0x80000029
2009-06-08 23:12:44.469 MoviePre 24: Farsi (Mac OS) == 0x8000008c
2009-06-08 23:12:44.470 MoviePre 25: Cyrillic (Mac OS Ukrainian) == 0x80000098
2009-06-08 23:12:44.470 MoviePre 26: Inuit (Mac OS) == 0x800000ec
2009-06-08 23:12:44.471 MoviePre 27: Unicode (UTF-32LE) == 0x9c000100
2009-06-08 23:12:44.471 MoviePre 28: Unicode (UTF-8) == 0x4
2009-06-08 23:12:44.472 MoviePre 29: Unicode (UTF-16) == 0xa
2009-06-08 23:12:44.473 MoviePre 30: Unicode (UTF-16BE) == 0x90000100
2009-06-08 23:12:44.473 MoviePre 31: Unicode (UTF-16LE) == 0x94000100
2009-06-08 23:12:44.480 MoviePre 32: Unicode (UTF-32) == 0x8c000100
2009-06-08 23:12:44.480 MoviePre 33: Unicode (UTF-32BE) == 0x98000100
2009-06-08 23:12:44.481 MoviePre 34: Western (ISO Latin 1) == 0x5
2009-06-08 23:12:44.481 MoviePre 35: Central European (ISO Latin 2) == 0x9
2009-06-08 23:12:44.481 MoviePre 36: Western (ISO Latin 3) == 0x80000203
2009-06-08 23:12:44.482 MoviePre 37: Central European (ISO Latin 4) == 0x80000204
2009-06-08 23:12:44.493 MoviePre 38: Cyrillic (ISO 8859-5) == 0x80000205
2009-06-08 23:12:44.493 MoviePre 39: Arabic (ISO 8859-6) == 0x80000206
2009-06-08 23:12:44.494 MoviePre 40: Greek (ISO 8859-7) == 0x80000207
2009-06-08 23:12:44.494 MoviePre 41: Hebrew (ISO 8859-8) == 0x80000208
2009-06-08 23:12:44.495 MoviePre 42: Turkish (ISO Latin 5) == 0x80000209
2009-06-08 23:12:44.495 MoviePre 43: Nordic (ISO Latin 6) == 0x8000020a
2009-06-08 23:12:44.506 MoviePre 44: Thai (ISO 8859-11) == 0x8000020b
2009-06-08 23:12:44.507 MoviePre 45: Baltic Rim (ISO Latin 7) == 0x8000020d
2009-06-08 23:12:44.510 MoviePre 46: Celtic (ISO Latin 8) == 0x8000020e
2009-06-08 23:12:44.511 MoviePre 47: Western (ISO Latin 9) == 0x8000020f
2009-06-08 23:12:44.511 MoviePre 48: Romanian (ISO Latin 10) == 0x80000210
2009-06-08 23:12:44.512 MoviePre 49: Latin-US (DOS) == 0x80000400
2009-06-08 23:12:44.512 MoviePre 50: Greek (DOS) == 0x80000405
2009-06-08 23:12:44.513 MoviePre 51: Baltic Rim (DOS) == 0x80000406
2009-06-08 23:12:44.513 MoviePre 52: Western (DOS Latin 1) == 0x80000410
2009-06-08 23:12:44.513 MoviePre 53: Greek (DOS Greek 1) == 0x80000411
2009-06-08 23:12:44.514 MoviePre 54: Central European (DOS Latin 2) == 0x80000412
2009-06-08 23:12:44.514 MoviePre 55: Cyrillic (DOS) == 0x80000413
2009-06-08 23:12:44.514 MoviePre 56: Turkish (DOS) == 0x80000414
2009-06-08 23:12:44.515 MoviePre 57: Portuguese (DOS) == 0x80000415
2009-06-08 23:12:44.516 MoviePre 58: Icelandic (DOS) == 0x80000416
2009-06-08 23:12:44.517 MoviePre 59: Hebrew (DOS) == 0x80000417
2009-06-08 23:12:44.517 MoviePre 60: Canadian French (DOS) == 0x80000418
2009-06-08 23:12:44.517 MoviePre 61: Arabic (DOS) == 0x80000419
2009-06-08 23:12:44.518 MoviePre 62: Nordic (DOS) == 0x8000041a
2009-06-08 23:12:44.518 MoviePre 63: Russian (DOS) == 0x8000041b
2009-06-08 23:12:44.519 MoviePre 64: Greek (DOS Greek 2) == 0x8000041c
2009-06-08 23:12:44.519 MoviePre 65: Thai (Windows, DOS) == 0x8000041d
2009-06-08 23:12:44.520 MoviePre 66: Japanese (Windows, DOS) == 0x8
2009-06-08 23:12:44.522 MoviePre 67: Simplified Chinese (Windows, DOS) == 0x80000421
2009-06-08 23:12:44.522 MoviePre 68: Korean (Windows, DOS) == 0x80000422
2009-06-08 23:12:44.524 MoviePre 69: Traditional Chinese (Windows, DOS) == 0x80000423
2009-06-08 23:12:44.524 MoviePre 70: Western (Windows Latin 1) == 0xc
2009-06-08 23:12:44.525 MoviePre 71: Central European (Windows Latin 2) == 0xf
2009-06-08 23:12:44.525 MoviePre 72: Cyrillic (Windows) == 0xb
2009-06-08 23:12:44.525 MoviePre 73: Greek (Windows) == 0xd
2009-06-08 23:12:44.526 MoviePre 74: Turkish (Windows Latin 5) == 0xe
2009-06-08 23:12:44.526 MoviePre 75: Hebrew (Windows) == 0x80000505
2009-06-08 23:12:44.526 MoviePre 76: Arabic (Windows) == 0x80000506
2009-06-08 23:12:44.527 MoviePre 77: Baltic Rim (Windows) == 0x80000507
2009-06-08 23:12:44.529 MoviePre 78: Vietnamese (Windows) == 0x80000508
2009-06-08 23:12:44.531 MoviePre 79: Western (ASCII) == 0x1
2009-06-08 23:12:44.532 MoviePre 80: Japanese (Shift JIS X0213) == 0x80000628
2009-06-08 23:12:44.533 MoviePre 81: Chinese (GBK) == 0x80000631
2009-06-08 23:12:44.534 MoviePre 82: Chinese (GB 18030) == 0x80000632
2009-06-08 23:12:44.534 MoviePre 83: Japanese (ISO 2022-JP) == 0x15
2009-06-08 23:12:44.535 MoviePre 84: Korean (ISO 2022-KR) == 0x80000840
2009-06-08 23:12:44.536 MoviePre 85: Japanese (EUC) == 0x3
2009-06-08 23:12:44.536 MoviePre 86: Simplified Chinese (EUC) == 0x80000930
2009-06-08 23:12:44.538 MoviePre 87: Traditional Chinese (EUC) == 0x80000931
2009-06-08 23:12:44.539 MoviePre 88: Korean (EUC) == 0x80000940
2009-06-08 23:12:44.540 MoviePre 89: Japanese (Shift JIS) == 0x80000a01
2009-06-08 23:12:44.541 MoviePre 90: Cyrillic (KOI8-R) == 0x80000a02
2009-06-08 23:12:44.541 MoviePre 91: Traditional Chinese (Big 5) == 0x80000a03
2009-06-08 23:12:44.541 MoviePre 92: Western (Mac Mail) == 0x80000a04
2009-06-08 23:12:44.542 MoviePre 93: Simplified Chinese (HZ GB 2312) == 0x80000a05
2009-06-08 23:12:44.542 MoviePre 94: Traditional Chinese (Big 5 HKSCS) == 0x80000a06
2009-06-08 23:12:44.542 MoviePre 95: Ukrainian (KOI8-U) == 0x80000a08
2009-06-08 23:12:44.546 MoviePre 96: Traditional Chinese (Big 5-E) == 0x80000a09
2009-06-08 23:12:44.547 MoviePre 97: Western (NextStep) == 0x2
2009-06-08 23:12:44.547 MoviePre 98: Non-lossy ASCII == 0x7
2009-06-08 23:12:44.548 MoviePre 99: Western (EBCDIC US) == 0x80000c01
2009-06-08 23:12:44.548 MoviePre 100: Western (EBCDIC Latin 1) == 0x80000c02
看到了吧,试一下几个中文编码吧,
最后我用的是第81项,代码如下:
NSString *pageSource = ;
无论log还是simulator均显示正常。
真机还未测试。
得到网页内容后随便加几个正则表达式就可以抓到自己想要的内容了:)
-------------------
---------------------------------------------------
解析从RSS拿来的GBK编码的数据乱码的处理方式如下
NSURL *url = ;
NSData *data = ;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);
NSString *retStr = [ initWithData:data encoding:enc];
页:
[1]