最近解决了一个hadoop启动卡死问题,记录一下。
执行start-all.sh 后,发现namenode的http端口无法访问,hadoop启动失败,查看进程发现各个hadoop的java进程都存在。使用jstack查看namenode的stacktrace,发现如下结果:
[java] viewplaincopy
- "main" prio=10 tid=0x00000000419e0000 nid=0x5031 runnable [0x00007fa79e3e0000]
- java.lang.Thread.State: RUNNABLE
- at java.io.FileInputStream.readBytes(Native Method)
- at java.io.FileInputStream.read(FileInputStream.java:199)
- at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
- at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- - locked <0x00007fa796cb5000> (a java.io.BufferedInputStream)
- at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
- at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
- at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- - locked <0x00007fa796cb4cc8> (a java.io.BufferedInputStream)
- at sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:453)
- at sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:123)
- at sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:118)
- at sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:114)
- at sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:171)
- - locked <0x00007fa796cb47d0> (a sun.security.provider.SecureRandom)
- at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
- - locked <0x00007fa796cb4b00> (a java.security.SecureRandom)
- at java.security.SecureRandom.next(SecureRandom.java:455)
- at java.util.Random.nextLong(Random.java:284)
- at org.mortbay.jetty.servlet.HashSessionIdManager.doStart(HashSessionIdManager.java:139)
- at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- - locked <0x00007fa796cb4490> (a java.lang.Object)
- at org.mortbay.jetty.servlet.AbstractSessionManager.doStart(AbstractSessionManager.java:168)
- at org.mortbay.jetty.servlet.HashSessionManager.doStart(HashSessionManager.java:67)
- at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- - locked <0x00007fa7965c8108> (a java.lang.Object)
- at org.mortbay.jetty.servlet.SessionHandler.doStart(SessionHandler.java:115)
- at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- - locked <0x00007fa7965c81c8> (a java.lang.Object)
- at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
- at org.mortbay.jetty.handler.ContextHandler.startContext(ContextHandler.java:537)
- at org.mortbay.jetty.servlet.Context.startContext(Context.java:136)
- at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1234)
- at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517)
- at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:460)
- at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- - locked <0x00007fa7966327e0> (a java.lang.Object)
- at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
- at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
- at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- - locked <0x00007fa796538ed0> (a java.lang.Object)
- at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
- at org.mortbay.jetty.Server.doStart(Server.java:222)
- at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- - locked <0x00007fa796441098> (a java.lang.Object)
- at org.apache.hadoop.http.HttpServer.start(HttpServer.java:461)
- at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:246)
- at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202)
- at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
- at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
- at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
进程卡在了namenode启动内部jetty时获取随机数。
既然是随机数的问题,简单写了一个测试代码进行测试,测试代码如下:
[java] viewplaincopy
- import java.security.SecureRandom;
- import java.util.Random;
- /**
- * created by haitao.yao @ Apr 1, 2011
- */
- public class TestRandom {
- /**
- * @param args
- */
- public static void main(String[] args) {
- int count = 10;
- while(count -- > 0){
- Random random = new SecureRandom();
- long value = random.nextLong();
- System.out.println(value);
- }
- }
- }
由于java.security.SecureRandom在linux上依赖于/dev/random (linux随机数生成机制参见这里),因此在出现问题的服务器上运行测试程序后运行测试脚本,结果如下:
[java] viewplaincopy
- haitao-yao@haitaoyao-laptop:/data/develop/java/jre/lib/security$ jps && lsof /dev/random
- 7399 Jps
- 7382 TestRandom
- COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
- java 7382 haitao-yao 4r CHR 1,8 0t0 4402 /dev/random
因此断定是由于随机数生成策略的问题。
google一下发现jetty的这个bug(Jetty HashSessionIdManager hangs on startup),也有人在jdk的bug database中提出过这个问题(参见这里),简单介绍一下:
java.security.SecureRandom依赖与/dev/random 生成随机数,可能由于系统interrupt不足,导致在jdk在使用/dev/random时卡死。jetty无法启动,最后导致整个namenode启动卡死。
解决方案在sun的bug database中也已经有人给出,即在java程序启动参数中添加:-Djava.security.egd=file:/dev/urandom,使用/dev/urandom生成随机数。
/dev/random和/dev/urandom的差异分析请参见这里 , 不再多说。
总结:
hadoop通过http协议提供html页面暴露系统内部状态,这在分布式系统的设计中是个非常好的feature,但是由于集成了jetty,而集成的jetty在hadoop中可配置性又不强,才暴露了这个问题。非常不理解为什么hadoop不使用独立的线程去启动内部的jetty,毕竟这并不是namenode的主要功能,因为这样的附属功能影响了系统的核心功能,未免有些得不偿失。
-EOF- |