设为首页 收藏本站
查看: 386|回复: 0

[经验分享] CDH对hadoop的一些配置指南,包括THP

[复制链接]

尚未签到

发表于 2016-12-11 06:54:08 | 显示全部楼层 |阅读模式
Tips and Guidelines

 

Selecting Appropriate JAR files for your MRv1 and YARN Jobs

Each implementation of the CDH4 MapReduce framework (MRv1 and YARN) consists of the artifacts (JAR files) that provide MapReduce functionality as well as auxiliary utility artifacts that are used during the course of the MapReduce job. When you submit a job either explicitly (using the Hadoop launcher script) or implicitly (via Java implementations) it is extremely important that you make sure that you reference utility artifacts that come with the same version of MapReduce implementation that is running on your cluster. The following table summarizes the names and location of these artifacts:

Name
MRv1 location
YARN location



streaming



/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/
hadoop-streaming-2.0.0-mr1-cdh<version>.jar


/usr/lib/hadoop-mapreduce/
hadoop-streaming.jar




rumen



N/A



/usr/lib/hadoop-mapreduce/
hadoop-rumen.jar




hadoop examples



/usr/lib/hadoop-0.20-mapreduce/
hadoop-examples.jar


/usr/lib/hadoop-mapreduce/
hadoop-mapreduce-examples.jar




distcp v1



/usr/lib/hadoop-0.20-mapreduce/
hadoop-tools.jar


/usr/lib/hadoop-mapreduce/
hadoop-extras.jar




distcp v2



N/A



/usr/lib/hadoop-mapreduce/
hadoop-distcp.jar




hadoop archives



/usr/lib/hadoop-0.20-mapreduce/
hadoop-tools.jar


/usr/lib/hadoop-mapreduce/
hadoop-archives.jar








Improving Performance

This section provides solutions to some performance problems, and describes configuration best practices.

DSC0000.jpg   Important:
If you are running CDH over 10Gbps Ethernet, improperly set network configuration or improperly applied NIC firmware or drivers can noticeably degrade performance. Work with your network engineers and hardware vendors to make sure that you have the proper NIC firmware, drivers, and configurations in place and that your network performs properly. Cloudera recognizes that network setup and upgrade are challenging problems, and will make best efforts to share any helpful experiences.



Disabling Transparent Hugepage Compaction

Most Linux platforms supported by CDH4 include a feature called transparent hugepage compaction which interacts poorly with Hadoop workloads and can seriously degrade performance.
Symptom: top and other system monitoring tools show a large percentage of the CPU usage classified as "system CPU". If system CPU usage is 30% or more of the total CPU usage, your system may be experiencing this issue.

What to do:
DSC0001.jpg   Note: In the following instructions, defrag_file_pathname depends on your operating system:

  • Red Hat/CentOS: /sys/kernel/mm/redhat_transparent_hugepage/defrag
  • Ubuntu/Debian, OEL, SLES: /sys/kernel/mm/transparent_hugepage/defrag




  • To see whether transparent hugepage compaction is enabled, run the following command and check the output:
    $ cat defrag_file_pathname


    • [always] never means that transparent hugepage compaction is enabled.

    • always [never] means that transparent hugepage compaction is disabled.

  • To disable transparent hugepage compaction, add the following command to /etc/rc.local :
    echo never > defrag_file_pathname



You can also disable transparent hugepage compaction interactively (but remember this will not survive a reboot).

To disable transparent hugepage compaction temporarily as root:
# echo 'never' > defrag_file_pathname
To disable transparent hugepage compaction temporarily using sudo:
$ sudo sh -c "echo 'never' > defrag_file_pathname"




Setting the vm.swappiness Linux Kernel Parameter

vm.swappiness is a Linux Kernel Parameter that controls how aggressively memory pages are swapped to disk. It can be set to a value between 0-100; the higher the value, the more aggressive the kernel is in seeking out inactive memory pages and swapping them to disk.
You can see what value vm.swappiness is currently set to by looking at /proc/sys/vm; for example:

cat /proc/sys/vm/swappiness
On most systems, it is set to 60 by default. This is not suitable for Hadoop clusters nodes, because it can cause processes to get swapped out even when there is free memory available. This can affect stability and performance, and may cause problems such as lengthy garbage collection pauses for important system daemons. Cloudera recommends that you set this parameter to 0; for example:

# sysctl -w vm.swappiness=0


Performance Enhancements in Shuffle Handler and IFile Reader

As of CDH4.1, the MapReduce shuffle handler and IFile reader use native Linux calls (posix_fadvise(2) and sync_data_range) on Linux systems with Hadoop native libraries installed. The subsections that follow provide details.
Shuffle Handler
You can improve MapReduce Shuffle Handler Performance by enabling shuffle readahead. This causes the TaskTracker or Node Manager to pre-fetch map output before sending it over the socket to the reducer.


  • To enable this feature for YARN, set the mapreduce.shuffle.manage.os.cache property to true (default). To further tune performance, adjust the value of the mapreduce.shuffle.readahead.bytes property. The default value is 4MB.


  • To enable this feature for MRv1, set the mapred.tasktracker.shuffle.fadvise property to true (default). To further tune performance, adjust the value of the mapred.tasktracker.shuffle.readahead.bytes property. The default value is 4MB.

IFile Reader
Enabling IFile readahead increases the performance of merge operations. To enable this feature for either MRv1 or YARN, set the mapreduce.ifile.readaheadproperty to true (default). To further tune the performance, adjust the value of the mapreduce.ifile.readahead.bytes property. The default value is 4MB.



Best Practices for MapReduce Configuration

The configuration settings described below can reduce inherent latencies in MapReduce execution. You set these values in mapred-site.xml.
Send a heartbeat as soon as a task finishes
Set the mapreduce.tasktracker.outofband.heartbeat property to true to let the TaskTracker send an out-of-band heartbeat on task completion to reduce latency; the default value is false:

<property>
<name>mapreduce.tasktracker.outofband.heartbeat</name>
<value>true</value>
</property>
Reduce the interval for JobClient status reports on single node systems
The jobclient.progress.monitor.poll.interval property defines the interval (in milliseconds) at which JobClient reports status to the console and checks for job completion. The default value is 1000 milliseconds; you may want to set this to a lower value to make tests run faster on a single-node cluster. Adjusting this value on a large production cluster may lead to unwanted client-server traffic.

<property>
<name>jobclient.progress.monitor.poll.interval</name>
<value>10</value>
</property>
Tune the JobTracker heartbeat interval
Tuning the minimum interval for the TaskTracker-to-JobTracker heartbeat to a smaller value may improve MapReduce performance on small clusters.

<property>
<name>mapreduce.jobtracker.heartbeat.interval.min</name>
<value>10</value>
</property>
Start MapReduce JVMs immediately
The mapred.reduce.slowstart.completed.maps property specifies the proportion of Map tasks in a job that must be completed before any Reduce tasks are scheduled. For small jobs that require fast turnaround, setting this value to 0 can improve performance; larger values (as high as 50%) may be appropriate for larger jobs.

<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>0</value>
</property>


Best practices for HDFS Configuration

This section indicates changes you may want to make in hdfs-site.xml.
Improve Performance for Local Reads

  Note:
Also known as short-circuit local reads, this capability is particularly useful for HBase and Cloudera Impala™. It improves the performance of node-local reads by providing a fast path that is enabled in this case. It requires libhadoop.so (the Hadoop Native Library) to be accessible to both the server and the client.
libhadoop.so is not available if you have installed from a tarball. You must install from an .rpm, .deb, or parcel in order to use short-circuit local reads.



Configure the following properties in hdfs-site.xml as shown:

<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name> dfs.client.read.shortcircuit.streams.cache.size</name>
<value>1000</value>
</property>

<property>
<name> dfs.client.read.shortcircuit.streams.cache.size.expiry.ms</name>
<value>1000</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
  Note:
The text _PORT appears just as shown; you do not need to substitute a number.



If /var/run/hadoop-hdfs/ is group-writable, make sure its group is root.



Tips and Best Practices for Jobs

This section describes changes you can make at the job level.
Use the Distributed Cache to Transfer the Job JAR
Use the distributed cache to transfer the job JAR rather than using the JobConf(Class) constructor and the JobConf.setJar() andJobConf.setJarByClass() method.
To add JARs to the classpath, use -libjars <jar1>,<jar2>, which will copy the local JAR files to HDFS and then use the distributed cache mechanism to make sure they are available on the task nodes and are added to the task classpath.
The advantage of this over JobConf.setJar is that if the JAR is on a task node it won't need to be copied again if a second task from the same job runs on that node, though it will still need to be copied from the launch machine to HDFS.

  Note:
-libjars works only if your MapReduce driver uses ToolRunner. If it doesn't, you would need to use the DistributedCache APIs (Cloudera does not recommend this).



For more information, see item 1 in the blog post How to Include Third-Party Libraries in Your MapReduce Job.
Changing the Logging Level on a Job (MRv1)
You can change the logging level for an individual job. You do this by setting the following properties in the job configuration (JobConf):


  • mapreduce.map.log.level
  • mapreduce.reduce.log.level

Valid values are NONE, INFO, WARN, DEBUG, TRACE, and ALL.
Example:

JobConf conf = new JobConf();
...
conf.set("mapreduce.map.log.level", "DEBUG");
conf.set("mapreduce.reduce.log.level", "TRACE");
...

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-312429-1-1.html 上篇帖子: Hadoop完全分布式(集群)安装教程【图文并茂】 下篇帖子: hadoop-mapreduce中reducetask运行分析
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表