设为首页 收藏本站
查看: 602|回复: 0

[经验分享] Troubleshooting ZooKeeper Operating Environment (zookeeper操作环境排查)

[复制链接]

尚未签到

发表于 2019-1-8 06:13:03 | 显示全部楼层 |阅读模式
  Thispage details specific problems people have seen, solutions (if solved) to those issues and the types of steps taken to troubleshoot the issue. Feel free to update with your experiences.

  The Hbase troubleshooting page also has insight for>Monitoring
  Itis important to monitor the ZK environment (hardware, network, processes, etc...) in order to more easily troubleshoot problems. Otherwise you miss out on important information for determining the cause of the problem. What type of monitoring are you doing on your cluster? You can monitor at the host level -- that will give you some insight on where to look; cpu, memory, disk, network, etc... You can also monitor at the process level -- the ZooKeeper server JMX interface will give you information about latencies and such (you can also use the four letter wordsfor that if you want to hack up some scripts instead of using JMX). JMXwill also give you insight into the JVM workings - so for example you could confirm/ruleout GC pauses causing the JVM Java threads to hang forlong periods of time (see below).
  Withoutmonitoring troubleshooting will be more difficult, but not impossible. JMX can be used through jconsole, or access the stats through the four letter words, also the log4j log contains much important/useful information.
Troubleshooting Checklist
  Thefollowing can be useful checklist when you are having issues with your ZK cluster, in particular if you are seeing large numbers of timeouts, sessions expirations, poor performance, or high operation latencies. Usethe following on all servers and potentially on clients as well:

  •   hdparm with the -t and -T options to test your disk IO
  •   time dd if=/dev/urandom bs=512000 of=/tmp/memtest count=1050

    •   time md5sum /tmp/memtest; time md5sum /tmp/memtest; time md5sum /tmp/memtest
    •   See ECC memory section below for more on this

  •   ethtool to check the configuration of your network
  •   ifconfig also to check network and examine error counts

    •   ZK uses TCP for network connectivity, errors on the NICs can cause poor performance

  •   scp/ftp/etc... can be used to verify connectivity, try copying large files between nodes
  •   these smoke and latency tests can be useful to verify a cluster
Compare your results to some baselines
  See the Latency Overviewpage for some latency baselines. You can also compare the performance of cpu/disk/mem/etc... that you have available to what is used in this test.
A word or two about heartbeats
  Keepin mind the the session timeout period is used by both the client and the server. If the ZK leader doesn't hear from the client w/in the timeout (say it's 5 sec) it will expire the session. The client is sending a ping after 1/3 of the timeout period. It expects to hear a response before another 1/3 of the timeout elapses, after which it will attempt to re-sync to another server in the cluster. In the 5 sec timeout case you are allowing 1.3 seconds for the request to go to the server, the server to respond back to the client, and the client to process the response. Check the latencies in ZK's JMX in order to get insight into this. i.e. if the server latency is high, say because of ioissues, or jvm swapping, vm latency, etc... that will cause the client/sessions to timeout.
Frequent client disconnects & session expirations
  ZooKeeper is a canary in a coal mine of sorts. Because of the heart-beating performed by the clients and servers ZooKeeperbased applications are very sensitive to things like network and systemlatencies. We often see client disconnects and session expirations associated with these types of problems.
  Take a look at this section to start.
Client disconnects due to client side swapping
  This linkspecifically discusses the negative impact of swapping in the context of the server. However swapping can be an issue for clients as well. It will delay, or potentially even stop for a significant period, the heartbeats from client to server, resulting in session expirations.
  As told by a user:

  "This issueis clearly linked to heavy utilization or swapping on the clients. I find that if I keep the clients from swapping that this error materializes>  As told by a HBase user:
  "After looking ganglia history, it's clear that the nodes in question were starved of memory, swapping like crazy.  The expired scanner lease, the region shutting down, and as you noted, the Zookeeper session expiry, were not a causal chain, but all the result of the machine grinding to a halt from swapping.  The MapReduce tasks were allocated too much memory, and an apparent memory leak in the job we were running was causing the tasks to eat into the RegionServer's share of the machine's memory.  I've reduced the memory allocated to tasks in hadoop's "mapred.child.java.opts" to ensure that the HADOOP_HEAPSIZE + total maximum memory allocated to tasks + the HBASE_HEAPSIZE is not greater than the memory available on the machine."
Hardware misconfiguration - NIC
  Inone case there was a cluster of 5k ZK clients attaching to a ZK cluster, ~20% of the clients had mis-configured NICs, this was causing high tcp packet loss (and therefore high network latency), which caused disconnects (timeout exceeded), but only under fairly high network load (which made it hard to track down!). In the end special processes were setup to continuously monitor client server network latency. Any spikes in the latencies observed were then correlated to the ZK logs (timeouts). In the end all of the NICs were reconfigured on these hosts.
Hardware - network switch

  Anotherissue with the same user as the NIC issue - a cluster of 5k ZK clients attaching to a ZK cluster. It turned out that the network switches had bad firmware which caused high packet latencies under heavy load. At certain times of day we would see high numbers of ZK client disconnects.It turned out that these were periods of heavy network activity, exacerbated by the ZK client session expirations (they caused even more network traffic). In the end the operations team spent a number of days testing/loading the network infrastructure until they were able to pin down the issue as being switch>Hardware - ifconfig is your friend
  Arecent issue we saw extremely poor performance from a 3 server ZK ensemble (cluster). Average and max latencies on operations as reported by the "stat" command on the servers was very high (multiple seconds). Turns out that one of the servers had a NIC that was dropping large numbers of packets due to framing problems. Switching out that server with another (no nic issue) resolved the issue. Weird thing was that SSH/SCP/PING etc reported no problems.
  Moral of the story: use ifconfig to verify the network interface if you are seeing issues on the cluster.
Hardware - hdparm is your friend
  Poordisk IO will also result in increased operation latencies. Use hdparm with the -t and -T options to verify the performance of persistent storage.
Hardware - ECC memory problems can be hard to track down
  I'veseen a particularly nasty problem where bad ECC memory was causing a single server to run an order of magnitude slower than the rest of the servers in the cluster. This caused some particularly nasty/random problems that were nearly impossible to track down (since the machine kept running, just slowly). Ops replaced the ECC memory and all was fine. See the troubleshooting checklist at the top of this page -- the dd/md5sum commands listed there can help to sniff this out (hint: compare the results on all of your servers and verify they are at least "close").
Virtual environments

  We'veseen situations where users run the entire zk cluster on a set of VMWare vms, all on the same host system. Latency on this configuration was >>> 10sec in some cases due to resource issues (in particular io - see the link I provided above, dedicated log devices arecritical to low latency operation of the ZK cluster). Obviously no one should be running in this configuration in production - in particular there will be no>Virtual environments - "Cloud Computing"

  Inone scenario involving EC2 ZK was seeing frequent client disconnects. The user had configured a timeout of 5 seconds, which is too low, probably much too low. Why? You are running in virtualized environments on non-dedicated hardware outside your control/inspection. There is typically no way to tell (unless you are running on the 8 core ec2 systems) if the ec2 host you are running on is over/under subscribed (other vms). There is no way to control disk latency either. You could be seeing large latencies due to resource contention on the ec2 host alone. In addition to that I've heard that network latencies in ec2 are high>GC pressure
  The Java GC can cause starvation of the Java threadsin the VM. This manifests itself as client disconnects and session expirations due to starvation of the heartbeat thread. The GC runs, locking out all Java threads from running.

  You can get an>  -Xloggc:gc.log-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
  gchisto is a useful tool for analyzing GC logs https://gchisto.dev.java.net/
  Additionally you can use 'jstat' on a running jvm to gain more insight into realtime GC activity, see: http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstat.html
  This issue can be resolved in a few ways:

  First look at using one of the>  e.g. the following JVM option: -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC

  Secondlyyou might try the solution used by HBASE, spawn a non-Java (JNI) threadto manage your ephemeral znodes. This is a pretty advanced option however, try the>Performance tuning
  Some things to keep in mind while tuning zookeeper performance.

  •   Verifythat logging isn't at  DEBUG.Check your log4j.properties file and change the line log4j.rootLogger=DEBUG, ROLLINGFILE to log4j.rootLogger=WARN ROLLINGFILE. Logging to disk on every action can greatly effect performance.
  •   Verify that you are using fast local disk for the journal.

  •   Test with http://github.com/phunt/zk-smoketest. This should>


运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-660471-1-1.html 上篇帖子: Zookeeper理解 下篇帖子: Configuring Zookeeper Cluster
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表