设为首页 收藏本站
查看: 1147|回复: 0

[经验分享] 记zookeeper 扰动导致HBase的一次不可用

[复制链接]

尚未签到

发表于 2017-4-19 10:17:41 | 显示全部楼层 |阅读模式
HBase运维过程中,最大的问题除了自己一些bug外,就是网络的延迟。这种延迟会导致hadoop的append的timeout,本来只是一个小事,但是会导致HBase因为无法append WAL log 退出。
而这次遇到的却是zookeeper的问题。
我们的集群里面有3台zookeeper。首先lead(A) 和其中的一台follower B(xx.xx.xx.85)连接出现异常,而这台zookeeper的follower B之后退出。

2011-08-01 03:28:30,013 [LearnerHandler-/xx.xx.xx.85:48270] ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:375)
2011-08-01 03:28:30,013 [LearnerHandler-/xx.xx.xx.85:48270] WARN org.apache.zookeeper.server.quorum.LearnerHandler: ******* GOODBYE /xx.xx.xx.85:48270 ********


B试图退出,但是退出失败。大量的session连接关闭。
而后,follower c 也出现异常。

2011-08-01 03:29:38,562 [CommitProcessor:0] ERROR org.apache.zookeeper.server.NIOServerCnxn: Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:148)
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1043)
at org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1080)
at org.apache.zookeeper.server.DataTree.setWatches(DataTree.java:1154)
at org.apache.zookeeper.server.ZKDatabase.setWatches(ZKDatabase.java:383)
at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:297)
at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)


整个过程中,zookeeper和Hbase的session都中断了。导致master遇到fatal的error而退出

2011-08-01 03:29:38,953 [main-EventThread] FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected zk exception getting RS nodes
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /SPN-hbase/rs
at org.apache.zookeeper.KeeperException.create(KeeperException.java:104)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:307)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:418)
at org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeChildrenChanged(RegionServerTracker.java:86)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:315)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:560)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:536)
2011-08-01 03:29:38,953 [main-EventThread] INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2011-08-01 03:29:38,954 [main-EventThread] WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: master:8100-0x230684e82d6738d-0x230684e82d6738d Unable to list children of znode /SPN-hbase/tokenauth/keys
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /SPN-hbase/tokenauth/keys
at org.apache.zookeeper.KeeperException.create(KeeperException.java:104)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:307)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:418)
at org.apache.hadoop.hbase.security.token.ZKSecretWatcher.nodeChildrenChanged(ZKSecretWatcher.java:116)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:315)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:560)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:536)
2011-08-01 03:29:38,954 [main-EventThread] ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:8100-0x230684e82d6738d-0x230684e82d6738d Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /SPN-hbase/tokenauth/keys
at org.apache.zookeeper.KeeperException.create(KeeperException.java:104)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:307)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:418)
at org.apache.hadoop.hbase.security.token.ZKSecretWatcher.nodeChildrenChanged(ZKSecretWatcher.java:116)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:315)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:560)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:536)
2011-08-01 03:29:38,954 [main-EventThread] ERROR org.apache.hadoop.hbase.security.token.ZKSecretWatcher: Error reading data from zookeeper
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /SPN-hbase/tokenauth/keys
at org.apache.zookeeper.KeeperException.create(KeeperException.java:104)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:307)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:418)
at org.apache.hadoop.hbase.security.token.ZKSecretWatcher.nodeChildrenChanged(ZKSecretWatcher.java:116)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:315)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:560)
at


由于我们还有backup master,然而,backup master因为zookeeper的缘故也无法正常工作。
之后,大量的regionserver down。

2011-08-01 03:29:38,565 [ZKSecretWatcher-leaderElector] INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Unexpected error from ZK, stopping candidate
2011-08-01 03:29:38,565 [ZKSecretWatcher-leaderElector] INFO org.apache.hadoop.hbase.security.token.AuthenticationTokenSecretManager: Stopping leader election, because: Unexpected error from ZK: KeeperErrorCode = InvalidACL for /SPN-hbase/tokenauth/keymaster

整个过程中,我们看到zookeeper的一次异常对HBase的致命打击。
现在,我们只能在regionserver和zookeeper上面加watchdog,对down的server快速重启来避免这种问题的发生。
HBase也意识到这个问题。
https://issues.apache.org/jira/browse/HBASE-3065
试图在zookeeper扰动的过程中尽量保证HBase的运行。增加了更多的retry

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-366245-1-1.html 上篇帖子: zookeeper学习(Curator客户端) 下篇帖子: zookeeper 共享锁代码的实现
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表