hadoop分布式集群部署以及过程中遇到的一些坑

hitl · 发表于 2018-10-28 15:12:45

　　在hadoop学习过程中，首先第一步是部署伪分布以及分布式集群。
　　在集群的部署过程中http://www.powerxing.com/install-hadoop-cluster/
　　使用这篇博客作为参考。
　　在部署过程中。遇到一些问题。
　　比如：用PYTHON 跑一个简单的MAPREDUCE 任务，首先需要现在streamingJAR包，简单的说这个包封装了一些常用的接口，PYTHON 通过标准输入输出来调用这个包。最终完成在内部用JAVA实现的功能。
　　下载地址为：http://www.java2s.com/Code/JarDownload/hadoop-streaming/
　　python  程序为 mapper.py
　　#!/usr/bin/env python
　　import sys
　　for line in sys.stdin:
　　line = line.strip()
　　words = line.split()
　　for word in words:
　　print "%s\t%s" % (word, 1)
　　以及reducer.py
　　**#!/usr/bin/env python
　　from operator import itemgetter
　　import sys
　　current_word = None
　　current_count = 0
　　word = None
　　for line in sys.stdin:
　　line = line.strip()
　　word, count = line.split('\t', 1)
　　try:
　　count = int(count)
　　except ValueError:  #count如果不是数字的话，直接忽略掉
　　continue
　　if current_word == word:
　　current_count += count
　　else:
　　if current_word:
　　print "%s\t%s" % (current_word, current_count)
　　current_count = count
　　current_word = word
　　if word == current_word:  #不要忘记最后的输出
　　print "%s\t%s" % (current_word, current_count)**
　　

运行方式：　　
hadoop jar ./hadoop-streaming-2.6.0.jar -file ./mappper.py -file ./reducer.py  -input /input -output /output
　　

　　
这里需要注意的是  /input 必须放在hadoop文件系统上
　　
hadoop fs -put input /input
　　
/output 不能存在，如果存在请先删除
　　

　　
另外在python中首行必须写 #!/usr/bin/env python
　　
否则可能会报错。具体原因可以看http://andylue2008.iteye.com/blog/1622260  这篇博客
　　

　　
另外如果使用hadoop fs -ls 这样的命令报错：找不到ls目录。是因为没有创建家目录
　　
hadoop fs -mkdir -p /user/hadoop

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] hadoop分布式集群部署以及过程中遇到的一些坑

浏览过的版块

扫码加入运维网微信交流群