thinkhk 发表于 2019-1-8 06:18:51

Zookeeper的一次迁移故障

      前阶段同事迁移Zookeeper(是给Kafka使用的以及flume使用)后发现所有Flume-producer/consumer端集体报错:
  
  
07 Jan 2014 01:19:32,571 INFO (org.apache.zookeeper.ClientCnxn$SendThread.startConnect:1058)- Opening socket connection to server xxx:2181
07 Jan 2014 01:19:32,572 INFO (org.apache.zookeeper.ClientCnxn$SendThread.primeConnection:947)- Socket connection established to xxx:2181, initiating session
07 Jan 2014 01:19:32,573 INFO (org.apache.zookeeper.ClientCnxn$SendThread.run:1183)- Unable to read additional data from server sessionid 0x142f42b91871911, likely server has closed socket, closing socket connection and attempting reconnect
07 Jan 2014 01:19:32,845 INFO (org.apache.zookeeper.ClientCnxn$SendThread.startConnect:1058)- Opening socket connection to server xxx:2181  一直在不断的重试连接失败再重试,问同事说:网路连通性早就验证过,然后查看server端日志发现:

  
  
  
2014-01-06 23:59:59,987 - INFO - Accepted socket connection from /xxx:45282
2014-01-06 23:59:59,987 - WARN - Connection request from old client xxx:45282; will
be dropped if server is in r-o mode
2014-01-06 23:59:59,987 - INFO - Refusing session request for client xxx:45282 as it
has seen zxid 0x60fd15564 our last zxid is 0x10000000f client must try another server
2014-01-06 23:59:59,987 - INFO - Closed socket connection for client xxx:45282 (no se
ssion established for client)
2014-01-06 23:59:59,989 - INFO - Accepted socket connection from xxx:45285  发现Flume还是保留原来的zxid,但是现在的zxid竟然是0,所以抛出异常!

if (connReq.getLastZxidSeen() > zkDb.dataTree.lastProcessedZxid) {
String msg = "Refusing session request for client "
+ cnxn.getRemoteSocketAddress()
+ " as it has seen zxid 0x"
+ Long.toHexString(connReq.getLastZxidSeen())
+ " our last zxid is 0x"
+ Long.toHexString(getZKDatabase().getDataTreeLastProcessedZxid())
+ " client must try another server";
LOG.info(msg);
throw new CloseRequestException(msg);
}      后来问同事是怎么做的迁移:先启动一套新的集群,然后关闭老的集群,同时在老集群的一个IP:2181起了一个haproxy代理新集群以为这样,可以做到透明迁移=。=,其实是触发了ZK的bug-832导致不停的重试连接,只有重启flume才可以解决

      正确的迁移方式是,把新集群加入老集群,然后修改Flume配置等一段时间(flume自动reconfig)后再关闭老集群就不会触发这个问题了.




页: [1]
查看完整版本: Zookeeper的一次迁移故障