2012年11月28日 出现故障," Unable to get data of znode /hbase/root-region-server"
问题比较诡异,两个机房,只有一个机房故障,5台服务器相续故障,错误日志相同。使用的HBase客户端版本为0.94.0
1)分析步骤:
1 jstack jmap 查看是否有死锁、block或内存溢出
jmap 看内存回收状况没有什么异常,内存和CPU占用都不多
jstack pid > test.log
pid: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
出现这种错误,使用-F参数
jstack -l -F pid >jstack.log
Found one Java-level deadlock:
=============================
"catalina-exec-800":
waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a java.lang.Object),
which is held by "catalina-exec-710"
"catalina-exec-710":
waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation),
which is held by "catalina-exec-29-EventThread"
"catalina-exec-29-EventThread":
waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a org.apache.hadoop.hbase.zookeeper.RootRegionTracker),
which is held by "catalina-exec-710"
Java stack information for the threads listed above:
===================================================
"catalina-exec-800":
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943)
- waiting to lock <0x0000000731902200> (a java.lang.Object)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:171)
at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:160)
at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:150)
"catalina-exec-710":
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599)
- waiting to lock <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158)
- locked <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
......
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
- locked <0x0000000731902200> (a java.lang.Object)
"catalina-exec-29-EventThread":
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98)
- waiting to lock <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604)
- locked <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
Found 1 deadlock.
warn.log中报告Interrupted异常 是由上述死锁引起的
2012-11-28 20:06:17 [WARN] hconnection-0x4384d0a47f41c63 Unable to get data of znode /hbase/root-region-server
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1253)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1129)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:264)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:522)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:498)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:156)
at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
......
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1042)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
2012-11-28 20:06:17 [ERROR] [hbase_error]
java.io.IOException: Giving up after tries=1
at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:192)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)