python + hadoop （案例）

luoson1 · 发表于 2017-12-17 07:43:21

　　python如何链接hadoop，并且使用hadoop的资源，这篇文章介绍了一个简单的案例！

一、python的map/reduce代码
　　首先认为大家已经对haoop已经有了很多的了解，那么需要建立mapper和reducer，分别代码如下：
　　1、mapper.py
　　

#!/usr/bin/env python　　
import sys
　　
for line in sys.stdin:
　　
line = line.strip()
　　
words = line.split()
　　
for word in words:
　　
print '%s\t%s' %(word, 1)
　　

　　2、reducer.py
　　

#!/usr/bin/env python　　
from operator import itemgetter
　　
import sys
　　

　　
current_word = None
　　
current_count = 0
　　
word = None
　　

　　
for line in sys.stdin:
　　
words = line.strip()
　　
word, count = words.split('\t')
　　

　　
try:
　　
count = int(count)
　　
except ValueError:
　　
continue
　　

　　
if current_word == word:
　　
current_count += count
　　
else:
　　
if current_word:
　　
print '%s\t%s' %(current_word, current_count)
　　
current_count = count
　　
current_word = word
　　

　　
if current_word == word:
　　
print '%s\t%s' %(current_word, current_count)
　　

　　建立了两个代码之后，测试一下：
　　

[qiu.li@l-tdata5.tkt.cn6 /export/python]$ echo "I like python hadoop , hadoop very good" | ./mapper.py | sort -k 1,1 | ./reducer.py　　
,
1　　
good
1　　
hadoop
2　　
I
1　　
like
1　　
python
1　　
very
1　　

二、上传文件
　　发现没啥问题，那么成功一半了，下面上传几个文件到hadoop做进一步测试。我在线上找了几个文件，命令如下：
　　

wget http://www.gutenberg.org/ebooks/20417.txt.utf-8　　
wget http://www.gutenberg.org/files/5000/5000-8.txt
　　
wget http://www.gutenberg.org/ebooks/4300.txt.utf-8
　　

　　查看下载的文件：
　　

[qiu.li@l-tdata5.tkt.cn6 /export/python]$ ls　　

20417.txt.utf-8 4300.txt.utf-8 5000-8.txt mapper.py reducer.py run.sh　　

　　上传文件到hadoop上面，命令如下：hadoop dfs -put ./*.txt /user/ticketdev/tmp （hadoop是配置好的，目录也是建立好的）
　　建立run.sh
　　

hadoop jar $STREAM \　　

-files ./mapper.py,./reducer.py \　　

-mapper ./mapper.py \　　

-reducer ./reducer.py \　　

-input /user/ticketdev/tmp/*.txt \　　

-output /user/ticketdev/tmp/output　　

　　查看结果：
　　

[qiu.li@l-tdata5.tkt.cn6 /export/python]$ hadoop dfs -cat /user/ticketdev/tmp/output/part-00000 | sort -nk 2 | tail　　
DEPRECATED: Use of
this script to execute hdfs command is deprecated.　　
Instead use the hdfs command
for it.　　

　　
it
2387　　
which
2387　　
that
2668　　
a
3797　　
is 4097
　　
to 5079
　　
in 5226
　　
and 7611
　　
of 10388
　　
the 20583
　　

　　三、参考文献：
　　http://www.cnblogs.com/wing1995/p/hadoop.html?utm_source=tuicool&utm_medium=referral

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] python + hadoop （案例）

浏览过的版块

扫码加入运维网微信交流群