设为首页 收藏本站
查看: 1534|回复: 0

[经验分享] Hadoop琐记

[复制链接]
累计签到:1 天
连续签到:1 天
发表于 2015-7-12 12:07:06 | 显示全部楼层 |阅读模式
  (1).
  在执行的如下命令时:
    % echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec | gunzip -
  Text
  由于是在eclipse中编译的此文件,所以默认加上了自定义的包名路径:ch04/StreamCompressor,如果跑到ch04目录下直接执行的话,会报如下错:
  Exception in thread "main" java.lang.NoClassDefFoundError: StreamCompressor (wrong name: ch04/StreamCompress)
  你需要到ch04的父目录中,执行如下命令才能成功:
    % echo "Text" | hadoop ch04.StreamCompressor org.apache.hadoop.io.compress.GzipCodec | gunzip -    
    同时不要忘了重新导入HADOOP_CLASSPATH为ch04的父目录。
  (2).
  决定hadoop是在单例模式或伪分布模式下运行的,就是conf目录下的那三个配置文件:core-site.xml, hdfs-site.xml, mapred-size.xml。当你准备切换模式时,就修改这三个文件,然后执行hadoop namenode -format即可(伪分布模式下还要执行start-dfs.sh, start-mapred.sh或start-all.sh;单例模式下不需要这些步骤,否则在你执行这些的时候会在logs/*.log中报如下错:“does not contain a valid host port authority:localhost”)。学习hadoop或者在开发应用程序的时候推荐现在单例模式下运行,等到确保在单例模式下应用程序一切ok的时候再迁移到full cluster中去,否则你每次从hdfs上拷文件的话很麻烦。
DSC0000.png
  (3).
  刚才发现执行命令start-all.sh时slaves上并没有启动datanode,查看slaves上的logs/***.log发现报错: java.io.IOException: Incompatible namespaceIDs in /home/admin/joe.wangh/hadoop/data/dfs.data.dir: namenode namespaceID = 898136669; datanode namespaceID = 2127444065。原因是:每次namenode format会重新创建一个namenodeId,而tmp/dfs/data(如果你在hdfs-site.xml中设置了dfs.data.dir属性,那么就在此属性指定的目录下)下包含了上次format下的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空tmp一下 的所有目录(或者执行stop-all.sh脚本把分布式文件系统先停掉). 现在可以采用如下方法来挽救:
Updating namespaceID of problematic datanodes


Big thanks to Jared Stehler for the following suggestion. I have not tested it myself yet, but feel free to try it out and send me your feedback. This workaround is "minimally invasive" as you only have to edit one file on the problematic datanodes:
1. stop the datanode
2. edit the value of namespaceID in /current/VERSION to match the value of the current namenode
3. restart the datanode
If you followed the instructions in my tutorials, the full path of the relevant file is /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data/current/VERSION (background: dfs.data.dir is by default set to ${hadoop.tmp.dir}/dfs/data, and we set hadoop.tmp.dir to /usr/local/hadoop-datastore/hadoop-hadoop).
If you wonder how the contents of VERSION look like, here's one of mine:
#contents of /current/VERSION
namespaceID=393514426
storageID=DS-1706792599-10.10.10.1-50010-1204306713481
cTime=1215607609074
storageType=DATA_NODE
layoutVersion=-13
  
(4).
  eclipse中Map/Reduce Location中默认的DFS Master和Map/Reduce Master端口号是50040和50020,这和hadoop的配置文件core-site.xml(fs.default.name)和dfs-site.xml中的默认值是不一样的(分别为8020和8021),注意要改过来。
  (5).
  eclipse下连接hadoop的源码时,如果没有源码的jar包也可以通过add external folder的方式把文件夹加进去,不过加进去的记得是org所在的目录
  (6).
  The local job runner uses a single JVM to run a job, so as long as all the classes that your job needs are on its classpath, then things will just work.
  In a distributed setting, things are a little more complex. For a start, a job’s classes must be packaged into a job JAR file to send to the cluster. Hadoop will find the job JAR automatically by searching for the JAR on the driver’s classpath that contains the class set in the setJarByClass() method (on JobConf or Job). Alternatively, if you want to set an explicit JAR file by its file path, you can use the setJar() method.
  (7). Classpath
  The client classpath
The user’s client-side classpath set by hadoop jar  is made up of:
· The job JAR file
· Any JAR files in the lib directory of the job JAR file, and the classes directory (ifpresent)
· The classpath defined by HADOOP_CLASSPATH, if set
Incidentally, this explains why you have to set HADOOP_CLASSPATH to point to dependent classes and libraries if you are running using the local job runner without a job JAR (hadoop CLASSNAME).
  The task classpath
On a cluster (and this includes pseudodistributed mode), map and reduce tasks run in separate JVMs, and their classpaths are not controlled by HADOOP_CLASSPATH.
HADOOP_CLASSPATH is a client-side setting and only sets the classpath for the driver JVM, which submits the job.
Instead, the user’s task classpath is comprised of the following:
· The job JAR file
· Any JAR files contained in the lib directory of the job JAR file, and the classes directory (if present)
· Any files added to the distributed cache, using the -libjars option, or the addFileToClassPath() method on DistributedCache (old API), or Job (new API) Packaging dependencies
  Given these different ways of controlling what is on the client and task classpaths, there are corresponding options for including library dependencies for a job.
· Unpack the libraries and repackage them in the job JAR.
· Package the libraries in the lib directory of the job JAR.
· Keep the libraries separate from the job JAR, and add them to the client classpath via HADOOP_CLASSPATH and to the task classpath via -libjars.
  (8).
  Hadoop中每个task均对应于tasktracker中的一个slot,系统中mapperslots总数与reducerslots总数的计算公式如下:
mapperslots总数=集群节点数×mapred.tasktracker.map.tasks.maximum
reducerslots总数=集群节点数×mapred.tasktracker.reduce.tasks.maximum
设置reducer的个数比集群中全部的reducerslot略少可以使得全部的reducetask可以同时进行,而且可以容忍一些reducetask失败。
  (9).
  有时候namenode启动不了的话,可以先把hadoop停掉(stop-all.sh),然后执行hadoop namenode -format命令把hdfs格式化(注意,你会丢掉hdfs上的所有数据),然后重新启动hadoop就可以了(start-all.sh)
  (10).
  在集群中格式化namenode时(hadoop namenode -format),如果hadoop已经启动,此命令不会产生任何作用。如果hadoop没有启动(只是在master端执行此命令),则master与各个slave的namespace ID会不一样(master的改变了)造成下次启动Hadoop集群异常。
  

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-85807-1-1.html 上篇帖子: Hadoop:The Definitive Guid 总结 Chapter 1~2 初识Hadoop、MapReduce 下篇帖子: Hadoop学习笔记02--HDFS Federation
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表