Hadoop & Hive Performance Tuning

jixuaa · 发表于 2015-7-12 12:20:56

转自：http://cloud-computation.blogspot.com/2011/09/hadoop-performance-tuning-hadoop-hive.html

Hadoop Cluster performance tuning is little hectic, because hadoop framework uses all type of resource for processing and analyzing data. So tuning its parameter for good performance is not static one. Parameter values should be change based on clusters following items for better performance:

· Operating System
· Processor and its number of cores
· Memory (RAM)
· Number of nodes in cluster
· Storage capacity of each node
· Network bandwidth
· Amount of input data
· Number of jobs in business logic

　　

Recommended OS for hadoop clusters is Linux, because windows and other GUI based OS runs lot of GUI (Graphical user interface) processes and will occupy most of the memory.

Storagecapacity of each node should have at-least 5GB extra after storing distributed HDFS input data. For Example if input data in 1 TB and with 1000 node cluster means, (1024GB x 3(replication factor))/1000 nodes = approx 3GB of distributed data in each node, so it is recommended to have at-least 8GB of storage in each node. Because each data node writeslog and need some space for swapping memory.

Networkbandwidth is recommended to have at-least 100 Mbps, as well known whileprocessing and loading data into HDFS, Hadoop moves lot of data over network. Lower bandwidth channel also degrade the performance of hadoop cluster.

Numberof nodes requires for cluster is depends on amount of data to be processed and capacity of each node. For example node with 2GB Memory and 2 core processor can process 1GB of data in average time. It can also process 2 data block (of 256MB 0r 512MB) simultaneously. For Example:  To process 5TB of data, it is recommended to have 1000 nodes with 4-to-8 Core processor and 8-to-10 GB of memory in each node to produce result in few minutes.

Hadoop Parameters:

Data block size (Chunk size):
   dfs.block.size parameter will be in hdfs-site.xml file, parametervalue is mentioned in number of bytes. Block size should be chosen completely based on each node memory capacity. If memory is less then set smaller block size. Because TaskTracker, bring whole block of data to memory while processing. So for 512MB RAM, it is advised to set blocksize as 64MB or 128MB. If it is dual core processor then TaskTracker can process 2 block of data at same time, so two data block will be bring to memory while processing, so it should be planned according to that, for this have to set concurrent tasktracker parameter also.

Number of Maps and Reducer:
      mapred.reduce.tasks & mapred.map.tasks parameter will be in mapred-site.xml file. By default, number of maps will be equal to number of data block. For example, if input data is 2GB and block size is 256MB means, while processing 8 Maps will run. It won’t bother about memory capacity and number of processor. So we need to tune this parameter to number of nodes*number of cores in each node.

Number of Maps = Total number of processor core available in cluster.

Asper above example it runs 8 Maps, if that cluster have only 4 processorcore, then multiple thread will start running and keep swapping the memory data, which will degrade the performance of hadoop cluster. In same way set number of reducer to number of core in cluster. After mapping job is over, most of nodes go idle and few nodes working for reducer to complete, to make reducer job to complete fast, set its valueto number of nodes or number of core processor.

Logging Level:　　HADOOP_ROOT_LOGGER = ERROR set this value in hadoop script file. By default its set to INFO mode, in information mode, hadoop will log all information about including all event, jobs, tasks completed, IOinfo, warning and error. It won’t increase huge performance improvement, but it will help to reduce number of log file I/Os and givesmall improvement in performance.
　　

I am testing these parameters with Hadoop and Hive framework from sql based queries. For checking performance improvement with configuration parameters, I use sample data of 100 million records and running some complex queries in Hive interface in top of Hadoop. In this part 2 we will see few more Hadoop configuration parameter to get maximum performance improvement in Hadoop cluster.

Map Output compression ( mapred.compress.map.output )
Bydefault this value set to false, its recommend to set this parameter totrue for cluster with large amount of input data to be processed.  Because of compression data transfer between nodes are fast. Map outputwill not directly move to reducer, intermediately it will write to disk. So this setting helps to save disk space and fast disk read/write.And it’s not recommended to set this parameter to true for small amountof input data to be processed, because it will increase the processing time for compressing and decompressing data. But for Big data compressing and decompression time is considerably small when compare totime its saves in transferring and disk read/write.

Oncewe set above configuration parameter to true, other dependent parameterwill be active such as setting compression technique (codec) and compression type.

Compression method or technique orcodec (mapred.map.output.compression.codec )
Defaultvalue for this parameter is org.apache.hadoop.io.compress.DefaultCodec.Other available codec are org.apache.hadoop.io.compress.GzipCodec. DefaultCodec will take more time but more compression. In LZO method it will take less time for compression amount of compression is less. Our own codec also can be added. Add codec or compression library which is suitable (best) for your input data type.

mapred.map.output.compression.typeparameter help to identify in which basis data should be compressed. User can set either RECORD or BLOCK. Record type is default type in which each individual value is compressed, means it will compress whole data block as it is. Block type is recommended one, in which data compressed based on data block key-value pairs, so it helps for sorting data in reducer side. In Cloudera Hadoop, default type is set to Block for better performance.

Three more configuration parameters are there

1.mapred.output.compress
2.mapred.output.compression.type
3.mapred.output.compression.codec

Sameabove rules apply here, but this parameter meant for MapReduce job output, first three parameters specify compressed output for map output alone. These three configuration parameter specify for all job output which should be compressed or not and in which type and codec.

More configuration parameter will be discussed here regarding hadoop hive performance tuning in upcoming posts

Before going to see some configuration parameter for performance tuning, I like to ask you a question, have you ever observed job and task tracker WebUI, there you can see lot of jobs are being killed after few seconds or minutes before completion. Why so? Have you ever think of it? Of course, few of them know. Those who already know about this, please skip next paragraph.

[NOTE:To check WebUI of Hadoop cluster: Open the browser, typehttp://masternode-machineip(or)localhost:portnumber. We can alsocheck this port number by changing the configuration parameter valueto the portnumber we want.]

Name	Port	Configurationparameter
Jobtracker	50030	mapred.job.tracker.http.address
Tasktrackers	50060	mapred.task.tracker.http.address

Asyou know, our data to be processed is replicated in multiple node. Sowhile processing data hadoop will also start processingsame(replicated) data chunk in multiple node. The node which completefirst will kill other jobs which is processing same data. advantagehere is job will be completed soon. Everyone will be puzzled, how?.While processing if multiple node process same data, whicheverprocess first will kill other, so first processed data is considers.After we killed on job in a node, that node start processing nextjob(next data chunk).By default hadoop works like this. Cause bydefault, following parameter value is set to true.

mapred.map.tasks.speculative.execution
mapred.reduce.tasks.speculative.execution

Isthis situation applicable to all type of applications? No.

Ifdata to be processed contain complex calculation, each job take moretime cause of executing in multiple node. This time can be utilizeby processing unique data chunks. Hence for the applications whichhas more complex operation to be perform, recommended to set thisparameter value to false.

io.sort.mb(buffer size for sorting)

Defaultvalue is 100, means 100 MB of buffer memory for sorting. After mapjobs are completed processing, hadoop will sort the map outputs forreducer. If map output are larger, then its recommended to increasethe value. we should also consider our memory size while increasingthis buffer value, anyhow it will take part in memory(RAM). Thisparameter will give good performance improvement because, if buffersize is large, then less amount of spill to disk. So it reducesoperations like read/write of data spills to disks.

io.sort.factor(stream merging factor)
Defaultvalue for this parameter is 10, this value is recommended to increasefor job with larger output similar like above. This value tell howmany streams can merge at once while sorting. This will giveperformance improvement because its reduce time in number ofintermediate merging.
Abovesuggestions are observed with Hadoop cluster with Hive querying, If anyinformation discussed here is misinterpreted, please leave a suggestionin comments. Recommendthis post by clicking Facebook ‘Like’ button and ‘+1’ at bottom of this page. By clicking like button you got regular update about my post in your facebook updates.

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] Hadoop & Hive Performance Tuning

扫码加入运维网微信交流群