设为首页 收藏本站
查看: 858|回复: 0

[经验分享] hadoop错误- Too many open files

[复制链接]

尚未签到

发表于 2015-7-12 12:03:36 | 显示全部楼层 |阅读模式
http://www.iyunv.com/wycg1984/archive/2010/04/27/1722431.html

下面是Hadoop使用过程中常见错误及解决方法的汇总 (英文部分会后续转为中文)

1. Too many open files错误
有時候 Map Reduce 的工作跑一跑,會發現 datanode 突然都陣亡,去看 log 會發現很多 Too many open files 的錯誤:   
2008-09-11 20:20:22,836 ERROR org.apache.hadoop.dfs.DataNode: 192.168.1.34:50010:DataXceiver: java.io.IOException: Too many open files   
at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)   
at sun.nio.ch.EPollArrayWrapper.(EPollArrayWrapper.java:68)   
at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:52)   
at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)   
at sun.nio.ch.Util.getTemporarySelector(Util.java:123)   
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:92)   
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)   
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:994)   
at java.lang.Thread.run(Thread.java:619)   
這個發生的原因是同時間很多 client 要去跟 datanode 要東西,因此消耗太多的 file descriptor,那又因為我在用的 Linux 上面預設單一 process 能開的檔案只有 1,024 個,>   
於是就造成了這種結果。   
修正的方法是去 /etc/security/limits.conf 加上這行:



* - nofile 8192
  讓單一 process 能同時開到 8,192 個檔案。改好後重開 datanode 就可以了。

2. 出现错误时采取的步骤:看Log,试试单机模式


  • If you are having problems, check the logs in the logs directory to see if there are any Hadoop errors or Java Exceptions.
  • Logs are named by machine and job they carry out in the cluster, and this can help you figure out which part of your configuration is giving you trouble.
  • Even if you were very careful, the problem is probably with your configuration. Try running the grep example from the QuickStart. If it doesn't run then you need to check your configuration.
  • If you can't get it to work on a real cluster, try it on a single-node.

3. 常见问题与解决
  Symptom
Possible Problem
Possible Solution
  You get an error that you cluster is in "safe mode"
Your cluster enters safe mode when it hasn't been able to verify that all the data nodes necessary to replicate your data are up and responding. Check the documentation to learn more about safe mode.


  • First, wait a minute or two and then retry your command. If you just started your cluster, it's possible that it isn't fully initialized yet.
  • If waiting a few minutes didn't help and you still get a "safe mode" error, check your logs to see if any of your data nodes didn't start correctly (either they have Java exceptions in their logs or they have messages stating that they are unable to contact some other node in your cluster). If this is the case you need to resolve the configuration issue (or possibly pick some new nodes) before you can continue.
  You get a NoRouteToHostException in your logs or in stderr output from a command.
One of your nodes cannot be reached correctly. This may be a firewall issue, so you should report it to me.
The only workaround is to pick a new node to replace the unreachable one. Currently, I think that creusa is unreachable, but all other Linux boxes should be okay. None of the Macs will currently work in a cluster.
  You get an error that "remote host identification has changed" when you try to ssh to localhost.
You have moved your single node cluster from one machine in the Berry Patch to another. The name localhost thus is pointing to a new machine, and your ssh client thinks that it might be a man-in-the-middle attack.
You can ask your login to skip checking the validity of localhost. You do this by setting NoHostAuthenticationForLocalhost to yes in ~/.ssh/config. You can accomplish this with the following command:

echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config
  Your DataNode is started and you can create directories with bin/hadoop dfs -mkdir, but you get an error message when you try to put files into the HDFS (e.g., when you run a command like bin/hadoop dfs -put).
Creating directories is only a function of the NameNode, so your DataNode is not exercised until you actually want to put some bytes into a file. If you are sure that the DataNode is started, then it could be that your DataNodes are out of disk space.


  • Go to the HDFS info web page (open your web browser and go to http://namenode:dfs_info_port where namenode is the hostname of your NameNode and dfs_info_port is the port you chose dfs.info.port; if followed the QuickStart on your personal computer then this URL will be http://localhost:50070). Once at that page click on the number where it tells you how many DataNodes you have to look at a list of the DataNodes in your cluster.
  • If it says you have used 100% of your space, then you need to free up room on local disk(s) of the DataNode(s).
  • If you are on Windows then this number will not be accurate (there is some kind of bug either in Cygwin's df.exe or in Windows). Just free up some more space and you should be okay. On one Windows machine we tried the disk had 1GB free but Hadoop reported that it was 100% full. Then we freed up another 1GB and then it said that the disk was 99.15% full and started writing data into the HDFS again. We encountered this bug on Windows XP SP2.
  You try to run the grep example from the QuickStart but you get an error message like this:

java.io.IOException: Not a file:
  hdfs://localhost:9000/user/ross/input/conf
  
You may have created a directory inside the input directory in the HDFS. For example, this might happen if you run bin/hadoop dfs -put conf input twice in a row (this would create a subdirectory in input... why?).
The easiest way to get the example run is to just start over and make the input anew.

bin/hadoop dfs -rmr input
bin/hadoop dfs -put conf input
  Your DataNodes won't start, and you see something like this in logs/*datanode*:

Incompatible namespaceIDs in /tmp/hadoop-ross/dfs/data
  
Your Hadoop namespaceID became corrupted. Unfortunately the easiest thing to do reformat the HDFS.
You need to do something like this:

bin/stop-all.sh
rm -Rf /tmp/hadoop-your-username/*
bin/hadoop namenode -format
  Be VERY careful with rm -Rf
  When you try the grep example in the QuickStart, you get an error like the following:

org.apache.hadoop.mapred.InvalidInputException:
  Input path doesnt exist : /user/ross/input
  
You haven't created an input directory containing one or more text files.

bin/hadoop dfs -put conf input
  When you try the grep example in the QuickStart, you get an error like the following:

org.apache.hadoop.mapred.FileAlreadyExistsException:
  Output directory /user/ross/output already exists
  
You might have already run the example once, creating an output directory. Hadoop doesn't like to overwrite files.
Remove the output directory before rerunning the example:

bin/hadoop dfs -rmr output
  Alternatively you can change the output directory of the grep example, something like this:

bin/hadoop jar hadoop-*-examples.jar \
grep input output2 'dfs[a-z.]+'
  You can run Hadoop jobs written in Java (like the grep example), but your HadoopStreaming jobs (such as the Python example that fetches web page titles) won't work.
You might have given only a relative path to the mapper and reducer programs. The tutorial originally just specified relative paths, but absolute paths are required if you are running in a real cluster.
Use absolute paths like this from the tutorial:

bin/hadoop jar contrib/hadoop-0.15.2-streaming.jar \
  -mapper  $HOME/proj/hadoop/multifetch.py         \
  -reducer $HOME/proj/hadoop/reducer.py            \
  -input   urls/*                                  \
  -output  titles

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-85803-1-1.html 上篇帖子: Hadoop:The Definitive Guid 总结 Chapter 4 Hadoop I/O 下篇帖子: 初识hadoop
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表