设为首页 收藏本站
查看: 403|回复: 0

[经验分享] hadoop中的重要概念

[复制链接]

尚未签到

发表于 2015-7-13 11:24:32 | 显示全部楼层 |阅读模式
  学习了hadoop这几天,一些主要的概念必须得先弄清楚,下面是来自wiki.apache的一些很好的解释,整理如下:
  本文信息来源:http://wiki.apache.org/hadoop/FrontPage

1. NameNode  

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
The NameNode is a Single Point of Failure (单点故障)for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy(冗余). Hadoop 0.21+ has a BackupNameNode that is part of a plan to have an HA name service, but it needs active contributions from the people who want it (i.e. you) to make it Highly Available.
It is essential to look after the NameNode. Here are some recommendations from production use


  • Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
  • Use ECC RAM.
  • On Java6u15 or later, run the server VM with compressed pointers -XX:+UseCompressedOops to cut the JVM heap size down.
  • List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.
  • Configure the NameNode to store one set of transaction logs on a separate disk from the image.
  • Configure the NameNode to store another set of transaction logs to a network mounted disk.
  • Monitor the disk space available to the NameNode. If free space is getting low, add more storage.
  • Do not host DataNode, JobTracker or TaskTracker services on the same system.

If a NameNode does not start up, look at the TroubleShooting page.

2. DataNode

A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.
On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out toTaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data.
DataNode instances can talk to each other, which is what they do when they are replicating data.



  • There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.


  • An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow everyTaskTracker 100% of a CPU, and separate disks to read and write data.


  • Avoid using NFS for data storage in production system.


3. JobTracker

The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.


  • Client applications submit jobs to the Job tracker.
  • The JobTracker talks to the NameNode to determine the location of the data
  • The JobTracker locates TaskTracker nodes with available slots at or near the data
  • The JobTracker submits the work to the chosen TaskTracker nodes.
  • The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  • A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
  • When the work is completed, the JobTracker updates its status.
  • Client applications can poll the JobTracker for information.

The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

4. TaskTracker

A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. TheTaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

5. MapReduce

MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.

5.1 The Map

A map transform is provided to transform an input data row of key and value to an output key/value:



  • map(key1,value) -> list


That is, for an input it returns a list containing zero or more (key,value) pairs:


  • The output can be a different key from the input
  • The output can have multiple entries with the same key

5.2 The Reduce

A reduce transform is provided to take all values for a specific key, and generate a new list of the reduced output.



  • reduce(key2, list) -> list


5.3 The MapReduce Engine
  The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different keys and lists of data. On a large cluster of machines, you can go one step further, and run the Map operations on servers where the data lives. Rather than copy the data over the network to the program, you push out the program to the machines. The output list can then be saved to the distributed filesystem, and the reducers run to merge the results. Again, it may be possible to run these in parallel, each reducing different keys.


  • A distributed filesystem spreads multiple copies of the data across different machines. This not only offers reliability without the need for RAID-controlled disks, it offers multiple locations to run the mapping. If a machine with one copy of the data is busy or offline, another machine can be used.

  • A job scheduler (in Hadoop, the JobTracker), keeps track of which MR jobs are executing, schedules individual Maps, Reduces or intermediate merging operations to specific machines, monitors the success and failures of these individual Tasks, and works to complete the entire batch job.

  • The filesystem and Job scheduler can somehow be accessed by the people and programs that wish to read and write data, and to submit and monitor MR jobs.

Apache Hadoop is such a MapReduce engine. It provides its own distributed filesystem and runs [HadoopMapReduce] jobs on servers near the data stored on the filesystem -or any other supported filesystem, of which there is more than one.

5.4 Limitations



  • For maximum parallelism, you need the Maps and Reduces to be stateless, to not depend on any data generated in the same MapReduce job. You cannot control the order in which the maps run, or the reductions.

  • It is very inefficient if you are repeating similar searches again and again. A database with an index will always be faster than running an MR job over unindexed data. However, if that index needs to be regenerated whenever data is added, and data is being added continually, MR jobs may have an edge(有优势). That inefficiency can be measured in both CPU time and power consumed.
  • In the Hadoop implementation Reduce operations do not take place until all the Maps are complete (or have failed and been skipped). As a result, you do not get any data back until the entire mapping has finished.
  • There is a general assumption that the output of the reduce is smaller than the input to the Map. That is, you are taking a large datasource and generating smaller final values.

5.5 Will MapReduce/Hadoop solve my problems?

If you can rewrite your algorithms as Maps and Reduces, then yes. If not, then no.
It is not a silver bullet(喻指新技术) to all the problems of scale, just a good technique to work on large sets of data when you can work on small pieces of that dataset in parallel.

6. Pseudo Distributed Hadoop

Pseudo Distributed Hadoop is where Hadoop runs as set of independent JVMs, but only on a single host. It has much lower performance than a real Hadoop cluster, due to the smaller number of hard disks limiting IO bandwidth. It is, however, a good way to play with new MR algorithms on very small datasets, and to learn how to use Hadoop. Developers working in the Hadoop codebase usually test their code in this mode before deploying their build of Hadoop to a local test cluster.
If you are running in this mode (and don't have a proxy server fielding HTML requests), and have not changed the default port values, then both the NameNode andJobTracker can be reached from this page

6.1 Ports in Use

These are the standard ports; if the configuration files are changed then they will not be valid.



  • NameNode: http://localhost:50070/dfshealth.jsp


  • HDFS filesystem browser http://localhost:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/


  • Server logs http://localhost:50070/logs/


  • JobTracker: http://localhost:50030/jobtracker.jsp


  • TaskTracker: http://localhost:50060/tasktracker.jsp


6.2 Recommended Configuration Parameters for Pseudo-Distributed Hadoop

With only a single HDFS datanode, the replication factor should be set to 1 the same goes for the replication factor of submitted jars. You also need to tell the Job tracker to not try handing a failing task to another task tracker, or to blacklist a tracker that appears to fail a lot. While those options are essential in large clusters with many machines -some of which will start to fail, on a single node cluster they do more harm than good.

mapred.submit.replication=1
mapred.skip.attempts.to.start.skipping=1
mapred.max.tracker.failures=10000
mapred.max.tracker.blacklists=10000
mapred.map.tasks.speculative.execution=false
mapred.reduce.tasks.speculative.execution=false
tasktracker.http.threads=5

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-86279-1-1.html 上篇帖子: Hadoop学习笔记(三):Hive简介 下篇帖子: centos Hadoop安装实战
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表