【ZooKeeper Notes 29】 修复“ZooKeeper客户端打印当前连接的服务器地址为null”的Bug
转载请注明:@ni掌柜 nileader@gmail.com问题描述
公司之前进行了几次机房容灾演习中,经常是模拟一个机房挂掉的场景,把一个机房的网络切掉,使得这个机房内部网络通信正常,与外部的网络不通。在容灾演习过程中,我们发现ZK的客户端应用中出现大量类似这样的日志:
[*] An exception was thrown while closing send thread for ession 0x for server null, unexpected error, closing socket connection and attempting
从这个日志中,红色部分出现的是null。当时看到这个情况,觉得,正常情况正在,这个地方应用出现的是那个被隔离的机房中部署的ZK的机器IP的,但是这里出现的是null,非常困惑。
具体描述也可以在这里查看:https://issues.apache.org/jira/browse/ZOOKEEPER-1480
问题定位
看了下3.4.3及其以前版本的ZooKeeper代码,发现问题出在这里,日志打印的逻辑在这里:
[*]} catch (Throwable e) {
[*] if (closing) {
[*] if (LOG.isDebugEnabled()) {
[*] // closing so this is expected
[*] LOG.debug("An exception was thrown while closing send thread for session 0x"
[*] + Long.toHexString(getSessionId())
[*] + " : " + e.getMessage());
[*] }
[*] break;
[*] } else {
[*] // this is ugly, you have a better way speak up
[*] if (e instanceof SessionExpiredException) {
[*] LOG.info(e.getMessage() + ", closing socket connection");
[*] } else if (e instanceof SessionTimeoutException) {
[*] LOG.info(e.getMessage() + RETRY_CONN_MSG);
[*] } else if (e instanceof EndOfStreamException) {
[*] LOG.info(e.getMessage() + RETRY_CONN_MSG);
[*] } else if (e instanceof RWServerFoundException) {
[*] LOG.info(e.getMessage());
[*] } else {
[*] LOG.warn(
[*] "Session 0x"
[*] + Long.toHexString(getSessionId())
[*] + " for server "
[*] + clientCnxnSocket.getRemoteSocketAddress()
[*] + ", unexpected error"
[*] + RETRY_CONN_MSG, e);
[*] }
可以看到,在打印日志过程,是通过clientCnxnSocket.getRemoteSocketAddress() 来获取当前连接的服务器地址的,那再来看下这个方法:
[*]/**
[*] * Returns the address to which the socket is connected.
[*] * @return ip address of the remote side of the connection or null if not connected
[*] */
[*] @Override
[*] SocketAddress getRemoteSocketAddress() {
[*] // a lot could go wrong here, so rather than put in a bunch of code
[*] // to check for nulls all down the chain let's do it the simple
[*] // yet bulletproof way
[*] try {
[*] return ((SocketChannel) sockKey.channel()).socket()
[*] .getRemoteSocketAddress();
[*] } catch (NullPointerException e) {
[*] return null;
[*] }
[*]}
[*] /**
[*] * Returns the address of the endpoint this socket is connected to, or
[*] * null if it is unconnected.
[*] * @return a SocketAddress reprensenting the remote endpoint of this
[*] * socket, or null if it is not connected yet.
[*] * @see #getInetAddress()
[*] * @see #getPort()
[*] * @see #connect(SocketAddress, int)
[*] * @see #connect(SocketAddress)
[*] * @since 1.4
[*] */
[*] public SocketAddress getRemoteSocketAddress() {
[*] if (!isConnected())
[*] return null;
[*] return new InetSocketAddress(getInetAddress(), getPort());
[*]}
所以,现在基本就可以定位问题了,如果服务器端非正常关闭socket连接(例如容灾演习的时候把机房网络切断),那么getRemoteSocketAddress这个方法就会返回null了,也就是日志中为什么出现null的原因了。
问题解决
这个日志输出对于开发人员来说非常重要,在排查问题过程中可以清楚的定位当时是哪台服务器出现问题,但是这里一旦输出null,那么将无从下手。这里我做了一些改进,确保出现问题的时候,客户端能够输出当前出现问题的服务器IP。在这里下载补丁:https://github.com/downloads/nileader/taokeeper/getCurrentZooKeeperAddr_for_3.4.3.patch
首先是给org.apache.zookeeper.client.HostProvider类添加两个接口,分别用于获取“当前地址列中正在使用的地址序号”和获取“所有地址列表”。关于ZooKeeper客户端地址列表获取和随机原理,具体可以查看这个文章《ZooKeeper客户端地址列表的随机原理》。
[*]public interface HostProvider {
[*] …… ……
[*] /**
[*] * Get current index that is connecting or connected.
[*] * @see ZOOKEEPER-1480:https://issues.apache.org/jira/browse/ZOOKEEPER-1480
[*] * */
[*] public int getCurrentIndex();
[*] /**
[*] * Get all server address that config when use zookeeper client.
[*] * @return List
[*] * @see ZOOKEEPER-1480:https://issues.apache.org/jira/browse/ZOOKEEPER-1480
[*] */
[*] public List getAllServerAddress();
[*]
[*]}
其次是修改org.apache.zookeeper.ClientCnxn类中日志输出逻辑:
[*]/**
[*] * Get current zookeeper addr that client is connected or connecting.
[*] * Note:The method will return null if can't not get host ip.
[*] * */
[*] private InetSocketAddress getCurrentZooKeeperAddr(){
[*] try {
[*] InetSocketAddress addr = null;
[*] if( null == hostProvider || null == hostProvider.getAllServerAddress() )
[*] return addr;
[*] int index = hostProvider.getCurrentIndex();
[*] if ( index >= 0) {
[*] addr = hostProvider.getAllServerAddress().get( index );
[*] }
[*] return addr;
[*] } catch ( Exception e ) {
[*] return null;
[*] }
[*] }
[*]…… ……
[*] //get current ZK host to log
[*] InetSocketAddress addr = getCurrentZooKeeperAddr();
[*]
[*] LOG.warn(
[*] "Session 0x"
[*] + Long.toHexString(getSessionId())
[*] + " for server ip: " + addr + ", detail conn: "
[*] + clientCnxnSocket.getRemoteSocketAddress()
[*] + ", unexpected error"
[*] + RETRY_CONN_MSG, e);
附件:http://down.运维网.com/data/2361716
页:
[1]