eddik 发表于 2019-1-8 07:15:06

使用Zabbix监控ZooKeeper服务的健康状态

  一 应用场景描述
  在目前公司的业务中,没有太多使用ZooKeeper作为协同服务的场景。但是我们将使用Codis作为Redis的集群部署方案,Codis依赖ZooKeeper来存储配置信息。所以做好ZooKeeper的监控也很重要。
  

  二 ZooKeeper监控要点
  系统监控
  内存使用量    ZooKeeper应当完全运行在内存中,不能使用到SWAP。Java Heap大小不能超过可用内存。

  Swap使用量    使用Swap会降低ZooKeeper的性能,设置vm.swappiness = 0
  网络带宽占用   如果发现ZooKeeper性能降低关注下网络带宽占用情况和丢包情况,通常情况下ZooKeeper是20%写入80%读入
  磁盘使用量    ZooKeeper数据目录使用情况需要注意
  磁盘I/O      ZooKeeper的磁盘写入是异步的,所以不会存在很大的I/O请求,如果ZooKeeper和其他I/O密集型服务公用应该关注下磁盘I/O情况
  

  ZooKeeper监控
  zk_avg/min/max_latency    响应一个客户端请求的时间,建议这个时间大于10个Tick就报警
  zk_outstanding_requests      排队请求的数量,当ZooKeeper超过了它的处理能力时,这个值会增大,建议设置报警阀值为10

  zk_packets_received      接收到客户端请求的包数量

  zk_packets_sent      发送给客户单的包数量,主要是响应和通知

  zk_max_file_descriptor_count   最大允许打开的文件数,由ulimit控制
  zk_open_file_descriptor_count    打开文件数量,当这个值大于允许值得85%时报警
  Mode                运行的角色,如果没有加入集群就是standalone,加入集群式follower或者leader

  zk_followers          leader角色才会有这个输出,集合中follower的个数。正常的值应该是集合成员的数量减1
  zk_pending_syncs       leader角色才会有这个输出,pending syncs的数量
  zk_znode_count         znodes的数量
  zk_watch_count         watches的数量
  Java Heap Size         ZooKeeper Java进程的
  

# echo ruok|nc 127.0.0.1 2181
imok

# echo mntr|nc 127.0.0.1 2181
zk_version3.4.6-1569965, built on 02/20/2014 09:09 GMT
zk_avg_latency0
zk_max_latency0
zk_min_latency0
zk_packets_received11
zk_packets_sent10
zk_num_alive_connections1
zk_outstanding_requests0
zk_server_stateleader
zk_znode_count17159
zk_watch_count0
zk_ephemerals_count1
zk_approximate_data_size6666471
zk_open_file_descriptor_count29
zk_max_file_descriptor_count102400
zk_followers2
zk_synced_followers2
zk_pending_syncs0

# echo srvr|nc 127.0.0.1 2181
Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
Latency min/avg/max: 0/0/0
Received: 26
Sent: 25
Connections: 1
Outstanding: 0
Zxid: 0x500000000
Mode: leader
Node count: 17159  

  三 编写Zabbix监控ZooKeeper的脚本和配置文件
  要让Zabbix收集到这些监控数据,有两种方法一种是每个监控项目通过zabbix agent单独获取,主动监控和被动监控都可以。还有一种方法就是将这些监控数据一次性使用zabbix_sender全部发送给zabbix。这里我们选择第二种方式。那么采用zabbix_sender一次性发送全部监控数据的脚本就不能像通过zabbix agent这样逐个获取监控项目来编写脚本。
  首先想办法将监控项目汇集成一个字典,然后遍历这个字典,将字典中的key:value对通过zabbix_sender的-k和-o参数指定发送出去
  

  echo mntr|nc 127.0.0.1 2181
  这条命令可以使用Python的subprocess模块调用,也可以使用socket模块去访问2181端口然后发送命令获取数据,获取到mntr执行的数据后还需要将其转化成为字典数据
  即需要将这种样式的数据
zk_version3.4.6-1569965, built on 02/20/2014 09:09 GMT
zk_avg_latency0
zk_max_latency0
zk_min_latency0
zk_packets_received91
zk_packets_sent90
zk_num_alive_connections1
zk_outstanding_requests0
zk_server_statefollower
zk_znode_count17159
zk_watch_count0
zk_ephemerals_count1
zk_approximate_data_size6666471
zk_open_file_descriptor_count27
zk_max_file_descriptor_count102400  

  转换成为这样的数据
{'zk_followers': 2, 'zk_outstanding_requests': 0, 'zk_approximate_data_size': 6666471, 'zk_packets_sent': 2089, 'zk_pending_syncs': 0, 'zk_avg_latency': 0, 'zk_version': '3.4.6-1569965, built on 02/20/2014 09:09 GMT', 'zk_watch_count': 2, 'zk_packets_received': 2090, 'zk_open_file_descriptor_count': 30, 'zk_server_ruok': 'imok', 'zk_server_state': 'leader', 'zk_synced_followers': 2, 'zk_max_latency': 28, 'zk_num_alive_connections': 2, 'zk_min_latency': 0, 'zk_ephemerals_count': 1, 'zk_znode_count': 17159, 'zk_max_file_descriptor_count': 102400}  

  

  到最后需要使用zabbix_sender发送的数据格式这个样子的
  zookeeper.status这是key的名称
zookeeper.status:0
zookeeper.status:6666471
zookeeper.status:48
zookeeper.status:0
zookeeper.status:3.4.6-1569965, built on 02/20/2014 09:09 GMT
zookeeper.status:0
zookeeper.status:49
zookeeper.status:27
zookeeper.status:imok
zookeeper.status:follower
zookeeper.status:0
zookeeper.status:1
zookeeper.status:0
zookeeper.status:1
zookeeper.status:17159
zookeeper.status:102400  

  

  精简代码如下:
#!/usr/bin/python
import socket
#from StringIO import StringIO
from cStringIO import StringIO
s=socket.socket()
s.connect(('localhost',2181))
s.send('mntr')
data_mntr=s.recv(2048)
s.close()
#print data_mntr
h=StringIO(data_mntr)
result={}
zresult={}
for line inh.readlines():
    key,value=map(str.strip,line.split('\t'))
    zkey='zookeeper.status' + '[' + key + ']'
    zvalue=value
    result=value
    zresult=zvalue
print result
print '\n\n'
print zresult# python test.py
{'zk_outstanding_requests': '0', 'zk_approximate_data_size': '6666471', 'zk_max_latency': '0', 'zk_avg_latency': '0', 'zk_version': '3.4.6-1569965, built on 02/20/2014 09:09 GMT', 'zk_watch_count': '0', 'zk_num_alive_connections': '1', 'zk_open_file_descriptor_count': '27', 'zk_server_state': 'follower', 'zk_packets_sent': '542', 'zk_packets_received': '543', 'zk_min_latency': '0', 'zk_ephemerals_count': '1', 'zk_znode_count': '17159', 'zk_max_file_descriptor_count': '102400'}

{'zookeeper.status': '0', 'zookeeper.status': '0', 'zookeeper.status': '0', 'zookeeper.status': '6666471', 'zookeeper.status': 'follower', 'zookeeper.status': '1', 'zookeeper.status': '0', 'zookeeper.status': '0', 'zookeeper.status': '543', 'zookeeper.status': '1', 'zookeeper.status': '17159', 'zookeeper.status': '542', 'zookeeper.status': '27', 'zookeeper.status': '102400', 'zookeeper.status': '3.4.6-1569965, built on 02/20/2014 09:09 GMT'}  

  

  详细代码如下:
#!/usr/bin/python

""" Check Zookeeper Cluster
zookeeper version should be newer than 3.4.x
# echo mntr|nc 127.0.0.1 2181
zk_version3.4.6-1569965, built on 02/20/2014 09:09 GMT
zk_avg_latency0
zk_max_latency4
zk_min_latency0
zk_packets_received84467
zk_packets_sent84466
zk_num_alive_connections3
zk_outstanding_requests0
zk_server_statefollower
zk_znode_count17159
zk_watch_count2
zk_ephemerals_count1
zk_approximate_data_size6666471
zk_open_file_descriptor_count29
zk_max_file_descriptor_count102400
# echo ruok|nc 127.0.0.1 2181
imok
"""
import sys
import socket
import re
import subprocess
from StringIO import StringIO
import os

zabbix_sender = '/opt/app/zabbix/sbin/zabbix_sender'
zabbix_conf = '/opt/app/zabbix/conf/zabbix_agentd.conf'
send_to_zabbix = 1

############# get zookeeper server status
class ZooKeeperServer(object):
    def __init__(self, host='localhost', port='2181', timeout=1):
      self._address = (host, int(port))
      self._timeout = timeout
      self._result= {}
    def _create_socket(self):
      return socket.socket()

    def _send_cmd(self, cmd):
      """ Send a 4letter word command to the server """
      s = self._create_socket()
      s.settimeout(self._timeout)
      s.connect(self._address)
      s.send(cmd)
      data = s.recv(2048)
      s.close()
      return data
    def get_stats(self):
      """ Get ZooKeeper server stats as a map """
      data_mntr = self._send_cmd('mntr')
      data_ruok = self._send_cmd('ruok')
      if data_mntr:
            result_mntr = self._parse(data_mntr)
      if data_ruok:
            result_ruok = self._parse_ruok(data_ruok)
      self._result = dict(result_mntr.items() + result_ruok.items())
      if not self._result.has_key('zk_followers') and not self._result.has_key('zk_synced_followers') and not self._result.has_key('zk_pending_syncs'):
         ##### the tree metrics only exposed on leader role zookeeper server, we just set the followers' to 0
         leader_only = {'zk_followers':0,'zk_synced_followers':0,'zk_pending_syncs':0}   
         self._result = dict(result_mntr.items() + result_ruok.items() + leader_only.items() )
      return self._result

    def _parse(self, data):
      """ Parse the output from the 'mntr' 4letter word command """
      h = StringIO(data)
      result = {}
      for line in h.readlines():
            try:
                key, value = self._parse_line(line)
                result = value
            except ValueError:
                pass # ignore broken lines
      return result
    def _parse_ruok(self, data):
      """ Parse the output from the 'ruok' 4letter word command """
      h = StringIO(data)
      result = {}
      ruok = h.readline()
      if ruok:
         result['zk_server_ruok'] = ruok
      return result

    def _parse_line(self, line):
      try:
            key, value = map(str.strip, line.split('\t'))
      except ValueError:
            raise ValueError('Found invalid line: %s' % line)
      if not key:
            raise ValueError('The key is mandatory and should not be empty')
      try:
            value = int(value)
      except (TypeError, ValueError):
            pass
      return key, value

    def get_pid(self):
#ps -ef|grep java|grep zookeeper|awk '{print $2}'
         pidarg = '''ps -ef|grep java|grep zookeeper|grep -v grep|awk '{print $2}' '''
         pidout = subprocess.Popen(pidarg,shell=True,stdout=subprocess.PIPE)
         pid = pidout.stdout.readline().strip('\n')
         return pid

    def send_to_zabbix(self, metric):
         key = "zookeeper.status[" +metric + "]"
         if send_to_zabbix > 0:
             #print key + ":" + str(self._result)
             try:
                subprocess.call() ], stdout=FNULL, stderr=FNULL, shell=False)
             except OSError, detail:
                print "Something went wrong while exectuting zabbix_sender : ", detail
         else:
                print "Simulation: the following command would be execucted :\n", zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", self._result, "\n"


def usage():
      """Display program usage"""
      print "\nUsage : ", sys.argv, " alive|all"
      print "Modes : \n\talive : Return pid of running zookeeper\n\tall : Send zookeeper stats as well"
      sys.exit(1)

accepted_modes = ['alive', 'all']
if len(sys.argv) == 2 and sys.argv in accepted_modes:
      mode = sys.argv
else:
      usage()


zk = ZooKeeperServer()
#print zk.get_stats()
pid = zk.get_pid()
if pid != "" andmode == 'all':
   zk.get_stats()
   # print zk._result
   FNULL = open(os.devnull, 'w')
   for key in zk._result:
       zk.send_to_zabbix(key)
   FNULL.close()
   print pid
elif pid != "" and mode == "alive":
    print pid
else:
    print 0  

  

  

  zabbix配置文件check_zookeeper.conf
UserParameter=zookeeper.status
[*],/usr/bin/python /opt/app/zabbix/sbin/check_zookeeper.py $1  

  重新启动zabbix agent服务
  

  

  

  

  

  

  

  四 制作Zabbix监控ZooKeeper的模板并设置报警阀值
  模板参见附件
  

  

  

  

  

  

  

  

  

  

  

  

  参考文档:
  https://blog.serverdensity.com/how-to-monitor-zookeeper/
  https://github.com/apache/zookeeper/tree/trunk/src/contrib/monitoring
  http://john88wang.blog.运维网.com/2165294/1708302
  

  

  

  

  

  



附件:http://down.运维网.com/data/2367399

页: [1]
查看完整版本: 使用Zabbix监控ZooKeeper服务的健康状态