Workflow:
Step1: Mapper
Mapper通过map方法每次处理一行文本,然后利用StringTokenizer将其分离成Tokens,然后就将键值对< , 1>输出,它将作为Combine的输入。
----------------------------
the first map emits:
<
Hello, 1>
< World, 1>
< Bye, 1>
< World,
1>
The second map emits:
< Hello, 1>
< Hadoop, 1>
<
Goodbye, 1>
< Hadoop, 1>
-----------------------------
Step2: Combine
在WordCount这个例子中,Combiner与Reducer是一样的,Combiner类负责将相同key的值合并起来。
----------------------------------
The output of the first map:
< Bye, 1>
< Hello, 1>
<
World, 2>
The output of the second map:
< Goodbye,
1>
< Hadoop, 2>
< Hello, 1>
----------------------------------
Step3: Reduce
Reducer类通过reduce方法,计算每个单词的总数,从而得到最终的输出。
-----------------------------------
Thus the output of the job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello,
2>
< World, 2>
-----------------------------------
>> JobTracker
它们都是由一个master服务JobTracker和多个运行于多个节点的slaver服务TaskTracker两个类提供的服务调度的。 master负责调度job的每一个子任务task运行于slave上,并监控它们,如果发现有失败的task就重新运行它,slave则负责直接执行每 一个task。TaskTracker都需要运行在HDFS的DataNode上,而JobTracker则不需要,一般情况应该把JobTracker 部署在单独的机器上。
JobTracker is a daemon per pool that administers all aspects of mapred activities.
JobTracker keeps all the current jobs by containing instances of JobInProgress.
Methods:
JobTracker.submitJob(): creates/adds a JobInProgress to jobs and jobsByArrival
JobTracker.pollForNewTask()
>> JobInProgress/TaskInProgress
JobInProgress represents a job as it is being tracked by JobTracker.
TaskInProgress represents a set of tasks for a given unique input, where input is a split for map task or a partition for reduce task.
>> MapTask/ReduceTask:
MapTask offers method run() that calls MapRunner.run(), which in turn calls the user-supplied Mapper.map().
ReduceTask offers run() that sorts input files using SequenceFile.Sorter.sort(), and then calls user-supplied Reducer.reduce().