|
摘自:http://hmilyzhangl.iteye.com/blog/1407214
一.崩溃原因
搭建的是一个hadoop测试集群,所以将数据备份参数设置为dfs.replication=1,这样如果有一台datanode损坏的话,数据 就会失去。但不幸的是,刚好就有一台机器由于负载过高,导致数据操坏。进而后面需要重启整个hadoop集群,重启后启动namenode启动不了。报如 下错误:
Java代码
- FSNamesystem initialization failed saveLeases found path /tmp/xxx/aaa.txt but no matching entry in namespace.
Java代码
- FSNamesystem initialization failed saveLeases found path /tmp/xxx/aaa.txt but no matching entry in namespace.
二.修复namenode
hadoop 集群崩溃了. 导致namenode启动不了.
1. 删除 namenode主节点的metadata配置目录
rm -fr /data/hadoop-tmp/hadoop-hadoop/dfs/name
2. 启动secondnamenode
使用start-all.sh命令启动secondnamenode,namenode的启动不了不管
3. 从secondnamenode恢复
使用命令: hadoop namenode -importCheckpoint
恢复过程中,发现数据文件有些已经损坏(因为dfs.replication=1),所以一直无法退出安全模式(safemode),一直报如下提示:
Java代码
- The ratio of reported blocks 0.8866 has not reached the threshold 0.9990 . Safe mode will be turned off automatically.
Java代码
- The ratio of reported blocks 0.8866 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
4.强制退出safemode
Java代码
- hadoop dfsadmin -safemode leave
Java代码
- hadoop dfsadmin -safemode leave
最后启动成功,查看hdfs网页报警告信息:
Java代码
- WARNING : There are about 257 missing blocks. Please check the log or run fsck.
Java代码
- WARNING : There are about 257 missing blocks. Please check the log or run fsck.
5.检查损坏的hdfs文件列表
使用命令可以打印出损坏的文件列表:
Java代码
Java代码 打印结果:
Java代码
- /user/hive/warehouse/pay_consume_orgi/dt= 2011 - 06 - 28 /consume_2011- 06 - 28 .sql: MISSING 1 blocks of total size 1250990 B..
- /user/hive/warehouse/pay_consume_orgi/dt= 2011 - 06 - 29 /consume_2011- 06 - 29 .sql: CORRUPT block blk_977550919055291594
-
- /user/hive/warehouse/pay_consume_orgi/dt= 2011 - 06 - 29 /consume_2011- 06 - 29 .sql: MISSING 1 blocks of total size 1307147 B..................Status: CORRUPT
- Total size: 235982871209 B
- Total dirs: 1213
- Total files: 1422
- Total blocks (validated): 4550 (avg. block size 51864367 B)
- ********************************
- CORRUPT FILES: 277
- MISSING BLOCKS: 509
- MISSING SIZE: 21857003415 B
- CORRUPT BLOCKS: 509
- ********************************
Java代码
- /user/hive/warehouse/pay_consume_orgi/dt=2011-06-28/consume_2011-06-28.sql: MISSING 1 blocks of total size 1250990 B..
- /user/hive/warehouse/pay_consume_orgi/dt=2011-06-29/consume_2011-06-29.sql: CORRUPT block blk_977550919055291594
-
- /user/hive/warehouse/pay_consume_orgi/dt=2011-06-29/consume_2011-06-29.sql: MISSING 1 blocks of total size 1307147 B..................Status: CORRUPT
- Total size: 235982871209 B
- Total dirs: 1213
- Total files: 1422
- Total blocks (validated): 4550 (avg. block size 51864367 B)
- ********************************
- CORRUPT FILES: 277
- MISSING BLOCKS: 509
- MISSING SIZE: 21857003415 B
- CORRUPT BLOCKS: 509
- ********************************
没有冗余备份,只能删除损坏的文件,使用命令:
Java代码
Java代码
三.总结
一定需要将你的secondnamenode及namenode分开在不同两台机器运行,增加namenode的容错性。以便在集群崩溃时可以从secondnamenode恢复数据. |
|
|