设为首页 收藏本站
查看: 1326|回复: 0

[经验分享] HBase client访问ZooKeeper获取root-region-server DeadLock问题(zookeeper.ClientCnxn Unab

[复制链接]

尚未签到

发表于 2015-9-6 09:12:54 | 显示全部楼层 |阅读模式
  2012年11月28日 出现故障," Unable to get data of znode /hbase/root-region-server"
  问题比较诡异,两个机房,只有一个机房故障,5台服务器相续故障,错误日志相同。使用的HBase客户端版本为0.94.0
  1)分析步骤:
  1 jstack jmap 查看是否有死锁、block或内存溢出
  jmap 看内存回收状况没有什么异常,内存和CPU占用都不多








jstack pid > test.log
pid: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
出现这种错误,使用-F参数
jstack -l -F pid >jstack.log

jstack -l pid >jstack.log

这时jstack.log并不会有有用信息,要去catalina.out里看,注意在服务下线后使用,容易造成tomcat僵死
catalina.out里发现deadlock死锁




Found one Java-level deadlock:
=============================
"catalina-exec-800":
waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a java.lang.Object),
which is held by "catalina-exec-710"
"catalina-exec-710":
waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation),
which is held by "catalina-exec-29-EventThread"
"catalina-exec-29-EventThread":
waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a org.apache.hadoop.hbase.zookeeper.RootRegionTracker),
which is held by "catalina-exec-710"
Java stack information for the threads listed above:
===================================================
"catalina-exec-800":
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943)
- waiting to lock <0x0000000731902200> (a java.lang.Object)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:171)
at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:160)
at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:150)
"catalina-exec-710":
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599)
- waiting to lock <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158)
- locked <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
......
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
- locked <0x0000000731902200> (a java.lang.Object)
"catalina-exec-29-EventThread":
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98)
- waiting to lock <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604)
- locked <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
Found 1 deadlock.
  warn.log中报告Interrupted异常 是由上述死锁引起的






2012-11-28 20:06:17 [WARN] hconnection-0x4384d0a47f41c63 Unable to get data of znode /hbase/root-region-server

java.lang.InterruptedException

        at java.lang.Object.wait(Native Method)

        at java.lang.Object.wait(Object.java:485)

        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1253)

        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1129)

        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:264)

        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:522)

        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:498)

        at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:156)

        at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)

        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)

        ......

        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)

        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1042)

        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)

        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)

        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)

        at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)

        at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)

        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)

        at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)

      

2012-11-28 20:06:17 [ERROR] [hbase_error]

java.io.IOException: Giving up after tries=1

        at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:192)

        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)

        at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)



Caused by: java.lang.InterruptedException: sleep interrupted

        at java.lang.Thread.sleep(Native Method)

        at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:189)

        ... 13 more  https://issues.apache.org/jira/browse/HBASE-5060  HBase client is blocked forever ,跟这个问题有点相似,但没有解决这问题
  2 同进程另启动线程访问root-region-server



                    // check if zk is ok
ZooKeeper zk = null;
Watcher watch = new Watcher() {
public void process(WatchedEvent event) {
}
};
String zookeeperQuorum = wbobjectHBaseConfMap.get("hbase.zookeeper.quorum");
if (StringUtils.isNotBlank(zookeeperQuorum)) {
try {
zk = new ZooKeeper(zookeeperQuorum, 30000, watch);
byte[] data = zk.getData("/hbase/root-region-server", watch, null);
ApiLogger.info(" get root-region-server success! ip:"
+ Util.toStr(data));
} catch (Exception e) {
ApiLogger.error(" get root-region-server error!" + e.getMessage());
} finally {
try {
zk.close();
} catch (InterruptedException e) {
ApiLogger.error("close zk error!");
}
}
}

  发现独立线程在整个进程死锁时还能正常工作,HBase的zookeeper实例异常死锁后就不能恢复,导致scan操作超时到30s+,并且不能恢复,正常应该在ms级别。
  因此认为是HBase客户端连接ZooKeeper时出问题,流程:
  出现网络抖动或者root表迁移,缓存表未命中,重新去获取root-region-server,结果获取失败,进行ZooKeeper重置操作
  经过研究死锁代码




HConnectionImplementation 发现 get root-region-server异常,锁定RootRegionTracker,试图锁定自身,执行resetZooKeeperTrackers并调用RootRegionTracker.stop 去重置 ZooKeeper

而与此同时RootRegionTracker在执行stop方法时,已经锁定了HConnectionImplementation,还需要锁定它本身(synchronized方法),但其已经在HConnectionImplementation中被锁定。

HConnectionImplementation代码中:

this.rootRegionTracker = new RootRegionTracker(this.zooKeeper, this);

调用时也没有超时cancel代码 这样就导致了两个资源循环等待的死锁

2)仿真模拟

以下是个小程序,简单模拟这种死锁问题:


import java.util.concurrent.TimeUnit;
import org.apache.hadoop.hbase.ServerName;
class ZooKeeperNodeTracker {
private boolean stopped = false;
private AbortAble hConnectionImplementation;
public ZooKeeperNodeTracker(AbortAble hConnectionImplementation) {
this.hConnectionImplementation = hConnectionImplementation;
}
public synchronized void stop()  throws InterruptedException {
this.stopped = true;
System.out.println(Thread.currentThread()+"|"+Thread.currentThread().getId()+"stop zknode");
TimeUnit.MICROSECONDS.sleep(100);
notifyAll();
}
public boolean condition() {
return stopped;
}
public boolean start() {
stopped=false;
return true;
}
public synchronized boolean getData(int i) throws InterruptedException {
//error in get root region server
if (i %100 ==0){
hConnectionImplementation.resetZooKeeperTrackers();
throw new InterruptedException("interrupted");            
}
return true;
}
}
public class HConnectionManagerTest {
static class HConnectionImplementation implements AbortAble{
public HConnectionImplementation() {
rootRegionTracker = new ZooKeeperNodeTracker(this);
}
private volatile ZooKeeperNodeTracker rootRegionTracker;
@Override
public synchronized void resetZooKeeperTrackers() {
try{
if (rootRegionTracker != null) {
rootRegionTracker.stop();
rootRegionTracker = null;
System.out.println(Thread.currentThread()+"|"+Thread.currentThread().getId()+"resetZooKeeperTrackers");
}
}catch(InterruptedException e){
System.out.println(Thread.currentThread()+"----------resetZooKeeperTrackers Interrupted-----------");
}
}

public void testGetData(String name) {
int i = 1;
while (i >0) {
i++;
try {
rootRegionTracker.getData(i);
} catch (Exception e) {
resetZooKeeperTrackers();
}
if(i %100 ==0){
rootRegionTracker = new ZooKeeperNodeTracker(this);
System.out.println(name+" restart test");
}
}
}
}
public static void main(String[] args)  {
final HConnectionImplementation hcon = new HConnectionImplementation();
Thread scan1 = new Thread(new Runnable() {
public void run() {
hcon.testGetData("[scan1]");
}
});
Thread scan2 = new Thread(new Runnable() {
public void run() {
hcon.testGetData("[scan2]");
}
});
try{
scan1.start();
scan2.start();
TimeUnit.SECONDS.sleep(2);
}catch (InterruptedException e){
System.out.println("----------testgetdata -------interrupt");
}
}
}
  类名构造函数都模拟HBase Client,并且放大getData error的情形,当同时并发两个scan操作时,前一个scan过程中,获取不到root-region-server,在ZooKeeperNodeTracker中做stop() 时,后一个scan也开始在HConnectionImplementation中执行resetZooKeeperTrackers(),两个资源ZooKeeperNodeTracker和HConnectionImplementation被各自分别占用等待,导致死锁。模拟程序的死锁解除可以更改



public synchronized boolean getData
方法为


public boolean getData




public synchronized void resetZooKeeperTrackers() {
try{
if (rootRegionTracker != null) {
rootRegionTracker.stop();
  





public void resetZooKeeperTrackers() {
try{
if (rootRegionTracker != null) {
            synchronized(rootRegionTracker){
              rootRegionTracker.stop();
  

解除互斥条件解决问题
  3)最终解决方案
  通过Hadoop大会现场跟HBase开发者Ted Yu咨询,称0.94.0有很多bug不稳定,建议升级到0.94.2,通过查看relase note, 官方的两个patch地址已在0.94.1中修复 (通过对hbase源码进行分析找对问题对应点,再看对应源码svn详细修改记录)
1 通过避免嵌套重试循环来解决rpc线程卡死: https://issues.apache.org/jira/browse/HBASE-6326  .
2 通过等待-root-的region地址设置到root
region tracker 来避免deadlock问题:https://issues.apache.org/jira/browse/HBASE-6115  
  

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-109959-1-1.html 上篇帖子: zookeeper心跳机制流程梳理 下篇帖子: ZooKeeper与Curator注册和监控
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表