设为首页 收藏本站
查看: 476|回复: 0

[经验分享] hadoop中的重要类和接口解析

[复制链接]

尚未签到

发表于 2015-7-14 09:29:13 | 显示全部楼层 |阅读模式
  以下为hadoop中使用的常用类说明(来源于hadoop api),排列仅以学习时出现的顺序为参考,不做其他比较:

1. Configuration

public class Configuration extends Object implements Iterable, Writable
  Provides access to configuration parameters.

Resources
  Configurations are specified by resources. A resource contains a set of name/value pairs as XML data. Each resource is named by either a String or by a Path. If named by a String, then the classpath is examined for a file with that name. If named by a Path, then the local filesystem is examined directly, without referring to the classpath.
  Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:


  • core-default.xml : Read-only defaults for hadoop.
  • core-site.xml: Site-specific configuration for a given hadoop installation.
  Applications may add additional resources, which are loaded subsequent to these resources in the order they are added.

Final Parameters
  Configuration parameters may be declared final. Once a resource declares a value final, no subsequently-loaded resource can alter that value. For example, one might define a final parameter with:

  
dfs.client.buffer.dir
/tmp/hadoop/dfs/client
true

  Administrators typically define parameters as final in core-site.xml for values that user applications may not alter.

2. Serialization

public interface Serialization
All Known Implementing Classes:JavaSerialization, WritableSerialization   Encapsulates a Serializer/Deserializer pair.

3. Tool

public interface Toolextends Configurable
  A tool interface that supports handling of generic command-line options.
  Tool, is the standard for any Map-Reduce tool/application. The tool/application should delegate the handling of standard command-line options to ToolRunner.run(Tool, String[]) and only handle its custom arguments.

4. ToolRunner

public class ToolRunner extends Object
  A utility to help run Tools.
  ToolRunner can be used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse thegeneric hadoop command line arguments and modifies the Configuration of the Tool. The application-specific options are passed along without being modified.

5.1 Job (newer than JobConf)

public class Job extends JobContext
  The job submitter's view of the Job. It allows the user to configure the job, submit it, control its execution, and query the state. The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.

5.2 JobConf (has more methods than Job)
  A map/reduce job configuration.
  JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as-is described by JobConf, however:


  • Some configuration parameters might have been marked as final by administrators and hence cannot be altered.
  • While some job parameters are straight-forward to set (e.g. setNumReduceTasks(int)), some parameters interact subtly rest of the framework and/or job-configuration and is relatively more complex for the user to control finely (e.g. setNumMapTasks(int)).
  JobConf typically specifies the Mapper, combiner (if any), Partitioner, Reducer, InputFormat and OutputFormat implementations to be used etc.
  Optionally JobConf is used to specify other advanced facets of the job such as Comparators to be used, files to be put in theDistributedCache, whether or not intermediate and/or job outputs are to be compressed (and how), debugability via user-provided scripts ( setMapDebugScript(String)/setReduceDebugScript(String)), for doing post-processing on task logs, task's stdout, stderr, syslog. and etc.

6. GenericOptionsParser

public class GenericOptionsParser extends Object
  GenericOptionsParser is a utility to parse command line arguments generic to the Hadoop framework. GenericOptionsParser recognizes several standarad command line arguments, enabling applications to easily specify a namenode, a jobtracker, additional configuration resources etc.
  Generic Options
  The supported generic options are:


     -conf      specify a configuration file
-D             use value for given property
-fs       specify a namenode
-jt     specify a job tracker
-files     specify comma separated
files to be copied to the map reduce cluster
-libjars    specify comma separated
jar files to include in the classpath.
-archives     specify comma
separated archives to be unarchived on the compute machines.

7.Mapper

public class Mapper extends Object
  Maps input key/value pairs to a set of intermediate key/value pairs.
  Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
  The Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. Mapperimplementations can access the Configuration for the job via the JobContext.getConfiguration().
  The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) for each key/value pair in the InputSplit. Finally cleanup(Context) is called.
  All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the final output. Users can control the sorting and grouping by specifying two key RawComparator classes.
  The Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
  Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.
  Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via theConfiguration.
  If the job has zero reduces then the output of the Mapper is directly written to the OutputFormat without sorting by keys.

8. reducer

public class Reducer extends Object
  Reduces a set of intermediate values which share a key to a smaller set of values.
  Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.
  Reducer has 3 primary phases:



  • Shuffle
      The Reducer copies the sorted output from each Mapper using HTTP across the network.


  • Sort
      The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
      The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

    SecondarySort
      To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via Job.setGroupingComparatorClass(Class). The sort order is controlled by Job.setSortComparatorClass(Class).

    For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:

    • Map Input Key: url
    • Map Input Value: document
    • Map Output Key: document checksum, url pagerank
    • Map Output Value: url
    • Partitioner: by checksum
    • OutputKeyComparator: by checksum and then decreasing pagerank
    • OutputValueGroupingComparator: by checksum



  • Reduce
      In this phase the reduce(Object, Iterable, Context) method is called for each  in the sorted inputs.
      The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

  The output of the Reducer is not re-sorted.

9 RawComparator
  public interface RawComparator extends Comparator

All Known Implementing Classes:BooleanWritable.Comparator, BytesWritable.Comparator, ByteWritable.Comparator, DeserializerComparator,DoubleWritable.Comparator,
FloatWritable.Comparator, IntWritable.Comparator, JavaSerializationComparator,KeyFieldBasedComparator, KeyFieldBasedComparator,
LongWritable.Comparator, LongWritable.DecreasingComparator,MD5Hash.Comparator, NullWritable.Comparator, RecordComparator,
SecondarySort.FirstGroupingComparator,SecondarySort.IntPair.Comparator, Text.Comparator, UTF8.Comparator, WritableComparator   A Comparator that operates directly on byte representations of objects.
  

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-86495-1-1.html 上篇帖子: 搭建Cloud Computing测试环境--hadoop/hbase 下篇帖子: 云计算分布式大数据实战技术Hadoop:剖析NameNode和Secondary NameNode的工作机制和流程
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表