设为首页 收藏本站
查看: 744|回复: 0

[经验分享] hadoop中的一次集群任务执行超时问题查找过程

[复制链接]

尚未签到

发表于 2016-12-10 09:39:45 | 显示全部楼层 |阅读模式
 
 

问题背景

 
本次进行一个项目的重构,在某些活动数据量比较大的情况下,会偶尔出现1200s超时的情况,如下:
 

AttemptID:attempt_1410771599055_11709_m_000033_0 Timed out after 1200 secs

  
而hadoop会不断启动备份任务进行重试,重试也许成功,但失败的概率还是比较大:
 
DSC0000.png
 
 
经过分析,hadoop的任务都有个超时时间,使用下面的参数设置,表示1200s后如果没有进展,就会任务该任务超时,将其状态设置为FAILED。
 

-Dmapreduce.task.timeout=1200000
 
到底因为什么原因导致超时?为了继续分析这个问题,我们将这个参数设置得非常之大。
 

调整超时参数

 
将超时时间设置为24小时之后,发现任务不会FAILED,但是其执行了大概40多个小时,仍然还没有执行完成。
 
还好我们在任务执行过程中打了不少的log,以帮助分析问题。经过日志的分析,我们发现有下面的现象:

2014-09-22 00:17:29,005 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 77477) is collected!
2014-09-22 00:17:29,005 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 77477) is collected!
2014-09-22 00:17:29,005 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 77477) is collected!
2014-09-22 01:17:29,054 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 120096) is collected!
2014-09-22 01:17:29,064 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 120096) is collected!
2014-09-22 01:17:29,064 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 120096) is collected!
2014-09-22 01:17:29,064 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 120096) is collected!
...
2014-09-22 01:17:36,590 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 164747) is collected!
2014-09-22 01:17:36,590 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 164747) is collected!
2014-09-22 01:17:36,590 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 164747) is collected!
2014-09-22 01:17:36,590 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 164747) is collected!
2014-09-22 02:17:36,674 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 158198) is collected!
2014-09-22 02:17:36,683 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 158198) is collected!
2014-09-22 02:17:36,683 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 158198) is collected!
2014-09-22 02:17:36,683 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 158198) is collected!
....
2014-09-22 02:17:40,888 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 203233) is collected!
2014-09-22 02:17:40,888 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 203233) is collected!
2014-09-22 02:17:40,888 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 203233) is collected!
2014-09-22 03:17:40,925 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 79188) is collected!
2014-09-22 03:17:40,934 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 79188) is collected!
2014-09-22 03:17:40,934 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 79188) is collected!
2014-09-22 03:17:40,934 INFO [main] com.xxx.yo.phase1.Phase1Mapper: history(caid: 2000037, superid: 79188) is collected!
 
日志分析得出的结论便是,程序总会在某个时间点休息3600秒(大概1个小时),然后再执行一会儿,便又休息3600秒。
 

hadoop configuration中得出初步结论

 
我们在这1个小时对该java进程进行监控,发现该进程在此期间(jstack命令查看其日志),一直在一个点等待:

"main" prio=10 tid=0x000000000293f000 nid=0x1e06 runnable [0x0000000041b20000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000006e243c3f0> (a sun.nio.ch.Util$2)
- locked <0x00000006e243c3e0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000006e243c1a0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:170)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:135)
- locked <0x00000006e12dcc78> (a org.apache.hadoop.hdfs.RemoteBlockReader2)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:642)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:698)
- eliminated <0x00000006e12dcc18> (a org.apache.hadoop.hdfs.DFSInputStream)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:752)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793)
- locked <0x00000006e12dcc18> (a org.apache.hadoop.hdfs.DFSInputStream)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.xxx.app.MzSequenceFile$PartInputStream.read(MzSequenceFile.java:451)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at org.apache.hadoop.io.Text.readFields(Text.java:292)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at com.xxx.app.MzSequenceFile$Reader.deserializeValue(MzSequenceFile.java:672)
at com.xxx.app.MzSequenceFile$Reader.next(MzSequenceFile.java:684)
at com.xxx.app.MzSequenceFile$Reader.next(MzSequenceFile.java:692)
at com.xxx.yo.io.CombineFileRawLogReader.streamNext(CombineFileRawLogReader.java:284)
at com.xxx.yo.io.CombineFileRawLogReader.next(CombineFileRawLogReader.java:342)
at com.xxx.yo.io.CampaignRawLogReader.next(CampaignRawLogReader.java:73)
at com.xxx.yo.io.CampaignRawLogReader.next(CampaignRawLogReader.java:23)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:197)
- locked <0x00000006e01dd3e0> (a org.apache.hadoop.mapred.MapTask$TrackedRecordReader)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:183)
- locked <0x00000006e01dd3e0> (a org.apache.hadoop.mapred.MapTask$TrackedRecordReader)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
 
当时初步怀疑是JDK版本问题,Java NIO也确实存在着臭名昭著的epoll空轮询导致CPU100%问题,但这个bug在JDK6中的高版本就已经解决了,更何况我们使用的是1.7。而且我们通过top -p 进程id的方式查看其CPU占用率为0,排除了该bug。
 
经过日志的初步分析,发现3600s这个线索,从job的configuration中,初步查找出参数dfs.client.socket-timeout,单位毫秒。

-Ddfs.client.socket-timeout=3600000
 
试验性地将这个参数修改为60ms,可以看出出现超时的概率非常大,但会不断重试以继续:
 

2014-09-26 12:53:03,184 WARN [main] org.apache.hadoop.hdfs.DFSClient: Failed to connect to /192.168.7.22:50010 for block, add to deadNodes and continue.
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.7.17:22051 remote=/192.168.7.22:50010]
java.net.SocketTimeoutException: 60 millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.7.17:22051 remote=/192.168.7.22:50010]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:392)
at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:131)
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:1108)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:533)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:749)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:601)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at com.xxx.app.MzSequenceFile$Reader.init(MzSequenceFile.java:521)
at com.xxx.app.MzSequenceFile$Reader.<init>(MzSequenceFile.java:515)
at com.xxx.app.MzSequenceFile$Reader.<init>(MzSequenceFile.java:505)
at com.xxx.yo.io.CombineFileRawLogReader.<init>(CombineFileRawLogReader.java:146)
at com.xxx.yo.io.CampaignRawLogReader.next(CampaignRawLogReader.java:64)
at com.xxx.yo.io.CampaignRawLogReader.next(CampaignRawLogReader.java:22)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:197)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:183)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
 
于是,最终将这个参数设置为60s,这其实也是集群最终默认的超时时间,由于之前一次不明就里的优化,导致了后续这些问题的发生,因此在调整参数时,一定要注意了解清楚该参数造成的影响。
 

简要分析的结论

 
在Mapper端读取HDFS上的文件时,可能由于网络原因(由于我们的Split切分地比较大,因此不可能做到完全数据本地化)导致读取数据超时,原来居然设置成1个小时,而任务的超时时间仅设置为20分钟,因此只要发生读取数据超时,就必然会引起任务超时。
 
通过这次分析过程,学到了很多查找问题的方式,包括通过现象分析规律,得到线索,最终查找问题的原因。快速测试,不能忽略哪怕一个小的Exception,不行就是分析hadoop的源码,掌握其运行时行为。
 
  对于悬挂的task,如果task tracker在一段时间(默认是10min,可以通过mapred.task.timeout属性值来设置,单位是毫秒)内一直没有收到它的进度报告,则把它标记为失效,这时候就会出现前面所说的超时错误。
  TaskTracker通过心跳包告知JobTracker某个task attempt失败了,则JobTracker把该task尽量分配给另外一个task tracker来执行。如果同一个task连续4次(该值可以通过mapred.map.max.attempts和mapred.reduce.max.attempts属性值来设置)都执行失败,那JobTracker就不会再做更多的尝试了,本次job也就宣告失败了。
 

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-312205-1-1.html 上篇帖子: hadoop+hbase+hive+zookeeper集群版本升级配置整理 下篇帖子: Hadoop的mapred TaskTracker端源码概览
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表