Hadoop的InputFormats和OutputFormats

359025439 · 发表于 2015-7-12 07:48:36

InputFormat
　　InputFormat类用来产生InputSplit，并把它切分成record。

public interface InputFormat {
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
RecordReader getRecordReader(InputSplit split,JobConf job,Reporter reporter) throws IOException;
}
　　有时候一个文件虽然大于一个block size，但是你不希望它被切分，一种办法上把mapred.min.split.size提高到比该文件的长度还要大，另一个办法是自定义FileInputFormat的子类，让isSplitable()方法返回false。
　　InputSplit是由客户端计算出来提交给JobTracker的，JobTracker把它存放在Configuration中，mapper可以从中获取split的信息。
　　map.input.file　　　String　　　　被map处理的文件的路径
　　map.input.start　　 long　　　　从split的开头偏移的字节量
　　map.input.length　　long　　　　split的长度

FileInputFormat
　　有3个类实现了InputFormat接口中：DBInputFormat, DelegatingInputFormat, FileInputFormat。FileInputFormat是所有数据来自文件的InputFormat的基类。
　　通过FileInputFormat可以指定输入文件有哪些，而且FileInputFormat实现了InputFormat中的getSplits()方法，getRecordReader()则需要由FileInputFormat的子类来实现。
　　Path可以代表一个文件、一个目录、文件或目录的集合（通过使用glob）。
　　如果把一个目录传递给FileInputFormat.addPath()，这不是递归的模式，即该目录下的目录是不会作为输入数据源的，并且这种情况下会引发错误。解决办法是使用glob或filter来仅选择目录下的文件。下面看一下如何使用FileInputFormat的静态方法设置filter:

static void setInputPathFilter(Job job, Class keyClass, Class valueClass)
Adds a named output for the job.
void write(String namedOutput, K key, V value)
Write key and value to the namedOutput.
void write(String namedOutput, K key, V value, String baseOutputPath)
Write key and value to baseOutputPath using the namedOutput.

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] Hadoop的InputFormats和OutputFormats

浏览过的版块

扫码加入运维网微信交流群