设为首页 收藏本站
查看: 636|回复: 0

[经验分享] Hadoop & Hive Performance Tuning

[复制链接]
累计签到:1 天
连续签到:1 天
发表于 2015-7-12 12:20:56 | 显示全部楼层 |阅读模式
转自:http://cloud-computation.blogspot.com/2011/09/hadoop-performance-tuning-hadoop-hive.html

Hadoop Cluster performance tuning is little hectic, because hadoop framework uses all type of resource for processing and analyzing data. So tuning its parameter for good performance is not static one. Parameter values should be change based on clusters following items for better performance:


  • ·         Operating System
  • ·         Processor and its number of cores
  • ·         Memory (RAM)
  • ·         Number of nodes in cluster
  • ·         Storage capacity of each node
  • ·         Network bandwidth
  • ·         Amount of input data
  • ·         Number of jobs in business logic
  

Recommended OS for hadoop clusters is Linux, because windows and other GUI based OS runs lot of GUI (Graphical user interface) processes and will occupy most of the memory.


Storagecapacity of each node should have at-least 5GB extra after storing distributed HDFS input data. For Example if input data in 1 TB and with 1000 node cluster means, (1024GB x 3(replication factor))/1000 nodes = approx 3GB of distributed data in each node, so it is recommended to have at-least 8GB of storage in each node. Because each data node writeslog and need some space for swapping memory.



Networkbandwidth is recommended to have at-least 100 Mbps, as well known whileprocessing and loading data into HDFS, Hadoop moves lot of data over network. Lower bandwidth channel also degrade the performance of hadoop cluster.


Numberof nodes requires for cluster is depends on amount of data to be processed and capacity of each node. For example node with 2GB Memory and 2 core processor can process 1GB of data in average time. It can also process 2 data block (of 256MB 0r 512MB) simultaneously. For Example:  To process 5TB of data, it is recommended to have 1000 nodes with 4-to-8 Core processor and 8-to-10 GB of memory in each node to produce result in few minutes.







Hadoop Parameters:

Data block size (Chunk size):
       dfs.block.size parameter will be in hdfs-site.xml file, parametervalue is mentioned in number of bytes. Block size should be chosen completely based on each node memory capacity. If memory is less then set smaller block size. Because TaskTracker, bring whole block of data to memory while processing. So for 512MB RAM, it is advised to set blocksize as 64MB or 128MB. If it is dual core processor then TaskTracker can process 2 block of data at same time, so two data block will be bring to memory while processing, so it should be planned according to that, for this have to set concurrent tasktracker parameter also.


Number of Maps and Reducer:
          mapred.reduce.tasks & mapred.map.tasks parameter will be in mapred-site.xml file. By default, number of maps will be equal to number of data block. For example, if input data is 2GB and block size is 256MB means, while processing 8 Maps will run. It won’t bother about memory capacity and number of processor. So we need to tune this parameter to number of nodes*number of cores in each node.


Number of Maps = Total number of processor core available in cluster.


Asper above example it runs 8 Maps, if that cluster have only 4 processorcore, then multiple thread will start running and keep swapping the memory data, which will degrade the performance of hadoop cluster. In same way set number of reducer to number of core in cluster. After mapping job is over, most of nodes go idle and few nodes working for reducer to complete, to make reducer job to complete fast, set its valueto number of nodes or number of core processor.


Logging Level:  HADOOP_ROOT_LOGGER = ERROR set this value in hadoop script file. By default its set to INFO mode, in information mode, hadoop will log all information about including all event, jobs, tasks completed, IOinfo, warning and error. It won’t increase huge performance improvement, but it will help to reduce number of log file I/Os and givesmall improvement in performance.
  

I am testing these parameters with Hadoop and Hive framework from sql based queries. For checking performance improvement with configuration parameters, I use sample data of 100 million records and running some complex queries in Hive interface in top of Hadoop. In this part 2 we will see few more Hadoop configuration parameter to get maximum performance improvement in Hadoop cluster.


Map Output compression ( mapred.compress.map.output )
Bydefault this value set to false, its recommend to set this parameter totrue for cluster with large amount of input data to be processed.  Because of compression data transfer between nodes are fast. Map outputwill not directly move to reducer, intermediately it will write to disk. So this setting helps to save disk space and fast disk read/write.And it’s not recommended to set this parameter to true for small amountof input data to be processed, because it will increase the processing time for compressing and decompressing data. But for Big data compressing and decompression time is considerably small when compare totime its saves in transferring and disk read/write.



Oncewe set above configuration parameter to true, other dependent parameterwill be active such as setting compression technique (codec) and compression type.  


Compression method or technique orcodec (mapred.map.output.compression.codec )
Defaultvalue for this parameter is org.apache.hadoop.io.compress.DefaultCodec.Other available codec are org.apache.hadoop.io.compress.GzipCodec. DefaultCodec will take more time but more compression. In LZO method it will take less time for compression amount of compression is less. Our own codec also can be added. Add codec or compression library which is suitable (best) for your input data type.


mapred.map.output.compression.typeparameter help to identify in which basis data should be compressed. User can set either RECORD or BLOCK. Record type is default type in which each individual value is compressed, means it will compress whole data block as it is. Block type is recommended one, in which data compressed based on data block key-value pairs, so it helps for sorting data in reducer side. In Cloudera Hadoop, default type is set to Block for better performance.


Three more configuration parameters are there


1.mapred.output.compress
2.mapred.output.compression.type
3.mapred.output.compression.codec


Sameabove rules apply here, but this parameter meant for MapReduce job output, first three parameters specify compressed output for map output alone. These three configuration parameter specify for all job output which should be compressed or not and in which type and codec.


More configuration parameter will be discussed here regarding hadoop hive performance tuning in upcoming posts

Before going to see some configuration parameter for performance tuning, I like to ask you a question, have you ever observed job and task tracker WebUI, there you can see lot of jobs are being killed after few seconds or minutes before completion. Why so? Have you ever think of it? Of course, few of them know. Those who already know about this, please skip next paragraph.

[NOTE:To check WebUI of Hadoop cluster: Open the browser, typehttp://masternode-machineip(or)localhost:portnumber. We can alsocheck this port number by changing the configuration parameter valueto the portnumber we want.]



Name

Port

Configurationparameter

Jobtracker

50030

mapred.job.tracker.http.address

Tasktrackers

50060

mapred.task.tracker.http.address




Asyou know, our data to be processed is replicated in multiple node. Sowhile processing data hadoop will also start processingsame(replicated) data chunk in multiple node. The node which completefirst will kill other jobs which is processing same data. advantagehere is job will be completed soon. Everyone will be puzzled, how?.While processing if multiple node process same data, whicheverprocess first will kill other, so first processed data is considers.After we killed on job in a node, that node start processing nextjob(next data chunk).By default hadoop works like this. Cause bydefault, following parameter value is set to true.


mapred.map.tasks.speculative.execution
mapred.reduce.tasks.speculative.execution

Isthis situation applicable to all type of applications? No.

Ifdata to be processed contain complex calculation, each job take moretime cause of  executing in multiple node. This time can be utilizeby processing unique data chunks. Hence for the applications whichhas more complex operation to be perform, recommended to set thisparameter value to false.

io.sort.mb(buffer size for sorting)

Defaultvalue is 100, means 100 MB of buffer memory for sorting. After mapjobs are completed processing, hadoop will sort the map outputs forreducer. If map output are larger, then its recommended to increasethe value. we should also consider our memory size while increasingthis buffer value, anyhow it will take part in memory(RAM). Thisparameter will give good performance improvement because, if buffersize is large, then less amount of spill to disk. So it reducesoperations like read/write of data spills to disks.

io.sort.factor(stream merging factor)
Defaultvalue for this parameter is 10, this value is recommended to increasefor job with larger output similar like above. This value tell howmany streams can merge at once while sorting. This will giveperformance improvement because its reduce time in number ofintermediate merging.
Abovesuggestions are observed with Hadoop cluster with Hive querying, If anyinformation discussed here is misinterpreted, please leave a suggestionin comments. Recommendthis post by clicking  Facebook ‘Like’ button and ‘+1’ at bottom of this page. By clicking like button you got regular update about my post in your facebook updates.

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-85816-1-1.html 上篇帖子: hadoop资料 下篇帖子: hadoop的dfs.replication
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表