Troubleshooting ZooKeeper Operating Environment （zookeeper操作环境排查）

vike681 · 发表于 2019-1-8 06:13:03

　　Thispage details specific problems people have seen, solutions (if solved) to those issues and the types of steps taken to troubleshoot the issue. Feel free to update with your experiences.

　　The Hbase troubleshooting page also has insight for>Monitoring
　　Itis important to monitor the ZK environment (hardware, network, processes, etc...) in order to more easily troubleshoot problems. Otherwise you miss out on important information for determining the cause of the problem. What type of monitoring are you doing on your cluster? You can monitor at the host level -- that will give you some insight on where to look; cpu, memory, disk, network, etc... You can also monitor at the process level -- the ZooKeeper server JMX interface will give you information about latencies and such (you can also use the four letter wordsfor that if you want to hack up some scripts instead of using JMX). JMXwill also give you insight into the JVM workings - so for example you could confirm/ruleout GC pauses causing the JVM Java threads to hang forlong periods of time (see below).
　　Withoutmonitoring troubleshooting will be more difficult, but not impossible. JMX can be used through jconsole, or access the stats through the four letter words, also the log4j log contains much important/useful information.
Troubleshooting Checklist
　　Thefollowing can be useful checklist when you are having issues with your ZK cluster, in particular if you are seeing large numbers of timeouts, sessions expirations, poor performance, or high operation latencies. Usethe following on all servers and potentially on clients as well:

　　hdparm with the -t and -T options to test your disk IO
　　time dd if=/dev/urandom bs=512000 of=/tmp/memtest count=1050
- 　　time md5sum /tmp/memtest; time md5sum /tmp/memtest; time md5sum /tmp/memtest
- 　　See ECC memory section below for more on this
　　ethtool to check the configuration of your network
　　ifconfig also to check network and examine error counts
- 　　ZK uses TCP for network connectivity, errors on the NICs can cause poor performance
　　scp/ftp/etc... can be used to verify connectivity, try copying large files between nodes
　　these smoke and latency tests can be useful to verify a cluster

Compare your results to some baselines
　　See the Latency Overviewpage for some latency baselines. You can also compare the performance of cpu/disk/mem/etc... that you have available to what is used in this test.
A word or two about heartbeats
　　Keepin mind the the session timeout period is used by both the client and the server. If the ZK leader doesn't hear from the client w/in the timeout (say it's 5 sec) it will expire the session. The client is sending a ping after 1/3 of the timeout period. It expects to hear a response before another 1/3 of the timeout elapses, after which it will attempt to re-sync to another server in the cluster. In the 5 sec timeout case you are allowing 1.3 seconds for the request to go to the server, the server to respond back to the client, and the client to process the response. Check the latencies in ZK's JMX in order to get insight into this. i.e. if the server latency is high, say because of ioissues, or jvm swapping, vm latency, etc... that will cause the client/sessions to timeout.
Frequent client disconnects & session expirations
　　ZooKeeper is a canary in a coal mine of sorts. Because of the heart-beating performed by the clients and servers ZooKeeperbased applications are very sensitive to things like network and systemlatencies. We often see client disconnects and session expirations associated with these types of problems.
　　Take a look at this section to start.
Client disconnects due to client side swapping
　　This linkspecifically discusses the negative impact of swapping in the context of the server. However swapping can be an issue for clients as well. It will delay, or potentially even stop for a significant period, the heartbeats from client to server, resulting in session expirations.
　　As told by a user:

　　"This issueis clearly linked to heavy utilization or swapping on the clients. I find that if I keep the clients from swapping that this error materializes>　　As told by a HBase user:
　　"After looking ganglia history, it's clear that the nodes in question were starved of memory, swapping like crazy. The expired scanner lease, the region shutting down, and as you noted, the Zookeeper session expiry, were not a causal chain, but all the result of the machine grinding to a halt from swapping. The MapReduce tasks were allocated too much memory, and an apparent memory leak in the job we were running was causing the tasks to eat into the RegionServer's share of the machine's memory. I've reduced the memory allocated to tasks in hadoop's "mapred.child.java.opts" to ensure that the HADOOP_HEAPSIZE + total maximum memory allocated to tasks + the HBASE_HEAPSIZE is not greater than the memory available on the machine."
Hardware misconfiguration - NIC
　　Inone case there was a cluster of 5k ZK clients attaching to a ZK cluster, ~20% of the clients had mis-configured NICs, this was causing high tcp packet loss (and therefore high network latency), which caused disconnects (timeout exceeded), but only under fairly high network load (which made it hard to track down!). In the end special processes were setup to continuously monitor client server network latency. Any spikes in the latencies observed were then correlated to the ZK logs (timeouts). In the end all of the NICs were reconfigured on these hosts.
Hardware - network switch

　　Anotherissue with the same user as the NIC issue - a cluster of 5k ZK clients attaching to a ZK cluster. It turned out that the network switches had bad firmware which caused high packet latencies under heavy load. At certain times of day we would see high numbers of ZK client disconnects.It turned out that these were periods of heavy network activity, exacerbated by the ZK client session expirations (they caused even more network traffic). In the end the operations team spent a number of days testing/loading the network infrastructure until they were able to pin down the issue as being switch>Hardware - ifconfig is your friend
　　Arecent issue we saw extremely poor performance from a 3 server ZK ensemble (cluster). Average and max latencies on operations as reported by the "stat" command on the servers was very high (multiple seconds). Turns out that one of the servers had a NIC that was dropping large numbers of packets due to framing problems. Switching out that server with another (no nic issue) resolved the issue. Weird thing was that SSH/SCP/PING etc reported no problems.
　　Moral of the story: use ifconfig to verify the network interface if you are seeing issues on the cluster.
Hardware - hdparm is your friend
　　Poordisk IO will also result in increased operation latencies. Use hdparm with the -t and -T options to verify the performance of persistent storage.
Hardware - ECC memory problems can be hard to track down
　　I'veseen a particularly nasty problem where bad ECC memory was causing a single server to run an order of magnitude slower than the rest of the servers in the cluster. This caused some particularly nasty/random problems that were nearly impossible to track down (since the machine kept running, just slowly). Ops replaced the ECC memory and all was fine. See the troubleshooting checklist at the top of this page -- the dd/md5sum commands listed there can help to sniff this out (hint: compare the results on all of your servers and verify they are at least "close").
Virtual environments

　　We'veseen situations where users run the entire zk cluster on a set of VMWare vms, all on the same host system. Latency on this configuration was >>> 10sec in some cases due to resource issues (in particular io - see the link I provided above, dedicated log devices arecritical to low latency operation of the ZK cluster). Obviously no one should be running in this configuration in production - in particular there will be no>Virtual environments - "Cloud Computing"

　　Inone scenario involving EC2 ZK was seeing frequent client disconnects. The user had configured a timeout of 5 seconds, which is too low, probably much too low. Why? You are running in virtualized environments on non-dedicated hardware outside your control/inspection. There is typically no way to tell (unless you are running on the 8 core ec2 systems) if the ec2 host you are running on is over/under subscribed (other vms). There is no way to control disk latency either. You could be seeing large latencies due to resource contention on the ec2 host alone. In addition to that I've heard that network latencies in ec2 are high>GC pressure
　　The Java GC can cause starvation of the Java threadsin the VM. This manifests itself as client disconnects and session expirations due to starvation of the heartbeat thread. The GC runs, locking out all Java threads from running.

　　You can get an>　　-Xloggc:gc.log-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
　　gchisto is a useful tool for analyzing GC logs https://gchisto.dev.java.net/
　　Additionally you can use 'jstat' on a running jvm to gain more insight into realtime GC activity, see: http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstat.html
　　This issue can be resolved in a few ways:

　　First look at using one of the>　　e.g. the following JVM option: -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC

　　Secondlyyou might try the solution used by HBASE, spawn a non-Java (JNI) threadto manage your ephemeral znodes. This is a pretty advanced option however, try the>Performance tuning
　　Some things to keep in mind while tuning zookeeper performance.

　　Verifythat logging isn't at DEBUG.Check your log4j.properties file and change the line log4j.rootLogger=DEBUG, ROLLINGFILE to log4j.rootLogger=WARN ROLLINGFILE. Logging to disk on every action can greatly effect performance.
　　Verify that you are using fast local disk for the journal.
　　Test with http://github.com/phunt/zk-smoketest. This should>

账号		自动登录	找回密码
密码			立即注册

VMware vcenter+vSphere 6.5 U2共享

【跟谁学】韩宇极简英语课-技术人员不得不

用Zabbix通过JMX方式监控weblogic

winhex数据恢复教程（非常巨大，内容丰富）

Symantec Backup Exec 2015 2016/2012 BE20

NetScaler VPX部署之：NetScaler Gateway调

zabbix3.4.1安装部署+微信推送信息+大屏显

[经验分享] Troubleshooting ZooKeeper Operating Environment （zookeeper操作环境排查）

扫码加入运维网微信交流群