设为首页 收藏本站
查看: 1172|回复: 0

[经验分享] ganglia监控自定义metric实践

[复制链接]

尚未签到

发表于 2015-11-26 09:58:01 | 显示全部楼层 |阅读模式
  

Ganglia监控系统是UC Berkeley开源的一个项目,设计初衷就是要做好分布式集群的监控,监控层面包括资源层面和业务层面,资源层面包括cpu、memory、disk、IO、网络负载等,至于业务层面由于用户可以很方便的增加自定义的metric,因此可以用于做诸如服务性能、负载、出错率等的监控,例如某web服务的QPS、Http status错误率。此外,如果和Nagios集成起来还可以在某指标超过一定阈值时触发相应的报警。

  Ganglia相比zabbix的优势在于客户端收集agent(gmond)所带来的系统开销非常低,不会影响相关服务的性能。
  


  

ganglia主要有几个模块:



  • gmond: 部署在各个被监控机器上,用于定期将数据收集起来,进行广播或者单播。
  • gmetad:部署在server端,定时从配置的data_source中的host去拉取gmond收集好的数据
  • ganglia-web:将监控数据投递到web页面
  关于ganglia的安装本文不做过多介绍,传送门:http://www.it165.net/admin/html/201302/770.html
  


  本文主要介绍一下如何开发自定义的metric,方便监控自己关心的指标。
  主要有几大类的方法:
  1. 直接使用gmetric


  安装gmond的机器,会同时安装上/usr/bin/gmetric,该命令是将一个metric的name
value等信息进行广播的工具,例如  



  

/usr/bin/gmetric -c /etc/ganglia/gmond.conf --name=test --type=int32 --units=sec --value=2   
具体gmetric的选项见:http://manpages.ubuntu.com/manpages/hardy/man1/gmetric.1.html


此外,除了直接命令行使用gmetric外,还可以使用常见语言的binding,例如go、Java、python等,github上都有相关的binding可以使用,只需要import进来即可。
go语言   https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-go

ruby  https://github.com/igrigorik/gmetric/blob/master/lib/gmetric.rb   
Java  https://github.com/ganglia/ganglia_contrib/tree/master/gmetric-java
Python   https://github.com/ganglia/ganglia_contrib/tree/master/gmetric-python


2. 使用基于gmetric的第三方工具  
  本文以ganglia-logtailer举例: https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-logtailer
  

该工具基于logtail(debain)/logcheck(centos) package, 实现对日志的定时tail,然后通过指定classname来使用相应的类进行日志的分析,


根据自己关注的字段统计出自定义metric,并由gmetric广播出来。





例如我们根据自己服务的nginx日志格式,修改NginxLogtailer.py如下:



  

# -*- coding: utf-8 -*-
###
###  This plugin for logtailer will crunch nginx logs and produce these metrics:
###    * hits per second
###    * GETs per second
###    * average query processing time
###    * ninetieth percentile query processing time
###    * number of HTTP 200, 300, 400, and 500 responses per second
###
###  Note that this plugin depends on a certain nginx log format, documented in
##   __init__.
import time
import threading
import re
# local dependencies
from ganglia_logtailer_helper import GangliaMetricObject
from ganglia_logtailer_helper import LogtailerParsingException, LogtailerStateException
class NginxLogtailer(object):
# only used in daemon mode
period = 30
def __init__(self):
'''This function should initialize any data structures or variables
needed for the internal state of the line parser.'''
self.reset_state()
self.lock = threading.RLock()
# this is what will match the nginx lines
#log_format ganglia-logtailer
#    '$host '
#    '$server_addr '
#    '$remote_addr '
#    '- '
#    '"$time_iso8601" '
#    '$status '
#    '$body_bytes_sent '
#    '$request_time '
#    '"$http_referer" '
#    '"$request" '
#    '"$http_user_agent" '
#    '$pid';
# NOTE: nginx 0.7 doesn't support $time_iso8601, use $time_local
# instead
# original apache log format string:
# %v %A %a %u %{%Y-%m-%dT%H:%M:%S}t %c %s %>s %B %D \"%{Referer}i\" \"%r\" \"%{User-Agent}i\" %P
# host.com 127.0.0.1 127.0.0.1 - "2008-05-08T07:34:44" - 200 200 371 103918 - "-" "GET /path HTTP/1.0" "-" 23794
# match keys: server_name, local_ip, remote_ip, date, status, size,
#               req_time, referrer, request, user_agent, pid
self.reg = re.compile('^(?P<remote_ip>[^ ]+) (?P<server_name>[^ ]+) (?P<hit>[^ ]+) \[(?P<date>[^\]]+)\] &quot;(?P<request>[^&quot;]+)&quot; (?P<status>[^ ]+) (?P<size>[^ ]+) &quot;(?P<referrer>[^&quot;]+)&quot; &quot;(?P<user_agent>[^&quot;]+)&quot; &quot;(?P<forward_to>[^&quot;]+)&quot; &quot;(?P<req_time>[^&quot;]+)&quot;')
# assume we're in daemon mode unless set_check_duration gets called
self.dur_override = False
# example function for parse line
# takes one argument (text) line to be parsed
# returns nothing
def parse_line(self, line):
'''This function should digest the contents of one line at a time,
updating the internal state variables.'''
self.lock.acquire()
try:
regMatch = self.reg.match(line)
if regMatch:
linebits = regMatch.groupdict()
if '-' == linebits['request'] or 'file2get' in linebits['request']:
self.lock.release()
return
self.num_hits+=1
# capture GETs
if( 'GET' in linebits['request'] ):
self.num_gets+=1
# capture HTTP response code
rescode = float(linebits['status'])
if( (rescode >= 200) and (rescode < 300) ):
self.num_two+=1
elif( (rescode >= 300) and (rescode < 400) ):
self.num_three+=1
elif( (rescode >= 400) and (rescode < 500) ):
self.num_four+=1
elif( (rescode >= 500) and (rescode < 600) ):
self.num_five+=1
# capture request duration
dur = float(linebits['req_time'])
self.req_time += dur
# store for 90th % calculation
self.ninetieth.append(dur)
else:
raise LogtailerParsingException, &quot;regmatch failed to match&quot;
except Exception, e:
self.lock.release()
raise LogtailerParsingException, &quot;regmatch or contents failed with %s&quot; % e
self.lock.release()
# example function for deep copy
# takes no arguments
# returns one object
def deep_copy(self):
'''This function should return a copy of the data structure used to
maintain state.  This copy should different from the object that is
currently being modified so that the other thread can deal with it
without fear of it changing out from under it.  The format of this
object is internal to the plugin.'''
myret = dict( num_hits=self.num_hits,
num_gets=self.num_gets,
req_time=self.req_time,
num_two=self.num_two,
num_three=self.num_three,
num_four=self.num_four,
num_five=self.num_five,
ninetieth=self.ninetieth
)
return myret
# example function for reset_state
# takes no arguments
# returns nothing
def reset_state(self):
'''This function resets the internal data structure to 0 (saving
whatever state it needs).  This function should be called
immediately after deep copy with a lock in place so the internal
data structures can't be modified in between the two calls.  If the
time between calls to get_state is necessary to calculate metrics,
reset_state should store now() each time it's called, and get_state
will use the time since that now() to do its calculations'''
self.num_hits = 0
self.num_gets = 0
self.req_time = 0
self.num_two = 0
self.num_three = 0
self.num_four = 0
self.num_five = 0
self.ninetieth = list()
self.last_reset_time = time.time()
# example for keeping track of runtimes
# takes no arguments
# returns float number of seconds for this run
def set_check_duration(self, dur):
'''This function only used if logtailer is in cron mode.  If it is
invoked, get_check_duration should use this value instead of calculating
it.'''
self.duration = dur
self.dur_override = True
def get_check_duration(self):
'''This function should return the time since the last check.  If called
from cron mode, this must be set using set_check_duration().  If in
daemon mode, it should be calculated internally.'''
if( self.dur_override ):
duration = self.duration
else:
cur_time = time.time()
duration = cur_time - self.last_reset_time
# the duration should be within 10% of period
acceptable_duration_min = self.period - (self.period / 10.0)
acceptable_duration_max = self.period + (self.period / 10.0)
if (duration < acceptable_duration_min or duration > acceptable_duration_max):
raise LogtailerStateException, &quot;time calculation problem - duration (%s) > 10%% away from period (%s)&quot; % (duration, self.period)
return duration
# example function for get_state
# takes no arguments
# returns a dictionary of (metric => metric_object) pairs
def get_state(self):
'''This function should acquire a lock, call deep copy, get the
current time if necessary, call reset_state, then do its
calculations.  It should return a list of metric objects.'''
# get the data to work with
self.lock.acquire()
try:
mydata = self.deep_copy()
check_time = self.get_check_duration()
self.reset_state()
self.lock.release()
except LogtailerStateException, e:
# if something went wrong with deep_copy or the duration, reset and continue
self.reset_state()
self.lock.release()
raise e
# crunch data to how you want to report it
hits_per_second = mydata['num_hits'] / check_time
gets_per_second = mydata['num_gets'] / check_time
if (mydata['num_hits'] != 0):
avg_req_time = mydata['req_time'] / mydata['num_hits']
else:
avg_req_time = 0
two_per_second = mydata['num_two'] / check_time
three_per_second = mydata['num_three'] / check_time
four_per_second = mydata['num_four'] / check_time
five_per_second = mydata['num_five'] / check_time
# calculate 90th % request time
ninetieth_list = mydata['ninetieth']
ninetieth_list.sort()
num_entries = len(ninetieth_list)
if (num_entries != 0 ):
ninetieth_element = ninetieth_list[int(num_entries * 0.9)]
else:
ninetieth_element = 0
# package up the data you want to submit
hps_metric = GangliaMetricObject('nginx_hits', hits_per_second, units='hps')
gps_metric = GangliaMetricObject('nginx_gets', gets_per_second, units='hps')
avgdur_metric = GangliaMetricObject('nginx_avg_dur', avg_req_time, units='sec')
ninetieth_metric = GangliaMetricObject('nginx_90th_dur', ninetieth_element, units='sec')
twops_metric = GangliaMetricObject('nginx_200', two_per_second, units='hps')
threeps_metric = GangliaMetricObject('nginx_300', three_per_second, units='hps')
fourps_metric = GangliaMetricObject('nginx_400', four_per_second, units='hps')
fiveps_metric = GangliaMetricObject('nginx_500', five_per_second, units='hps')
# return a list of metric objects
return [ hps_metric, gps_metric, avgdur_metric, ninetieth_metric, twops_metric, threeps_metric, fourps_metric, fiveps_metric, ]
  

在被监控机器上部署ganglia-logtailer后,使用如下命令建立crond任务


*/1 * * * * root   /usr/local/bin/ganglia-logtailer --classname NginxLogtailer --log_file /usr/local/nginx-video/logs/access.log  --mode cron --gmetric_options '-C test_cluster -g nginx_status'


reload crond service,过一分钟后,在ganglia web上即可看到相应的metric信息:



  关于ganglia-logtailer的部署方法,详见:https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-logtailer
  


  3. 用支持的语言编写自己的module,本文以Python为例
  ganglia支持用户编写自己的Python
module,以下为github上简要介绍:


Writing a Python module is very simple. You just need to write it following a template and put the resulting Python
module (.py) in /usr/lib(64)/ganglia/python_modules.


A corresponding Python Configuration (.pyconf) file needs to reside in /etc/ganglia/conf.d/.

例如,编写一个检查机器温度的示例Python文件


  

acpi_file = &quot;/proc/acpi/thermal_zone/THRM/temperature&quot;
def temp_handler(name):  
try:
f = open(acpi_file, 'r')
except IOError:
return 0
for l in f:
line = l.split()
return int(line[1])
def metric_init(params):
global descriptors, acpi_file
if 'acpi_file' in params:
acpi_file = params['acpi_file']
d1 = {'name': 'temp',
'call_back': temp_handler,
'time_max': 90,
'value_type': 'uint',
'units': 'C',
'slope': 'both',
'format': '%u',
'description': 'Temperature of host',
'groups': 'health'}
descriptors = [d1]
return descriptors
def metric_cleanup():
'''Clean up the metric module.'''
pass
#This code is for debugging and unit testing
if __name__ == '__main__':
metric_init({})
for d in descriptors:
v = d['call_back'](d['name'])
print 'value for %s is %u' % (d['name'],  v)

有了module功能文件,还需要编写一个对应的配置文件(放在/etc/ganglia/conf.d/temp.pyconf下),&#26684;式如下:  
  

modules {
module {
name = &quot;temp&quot;
language = &quot;python&quot;
# The following params are examples only
#  They are not actually used by the temp module
param RandomMax {
value = 600
}
param ConstantValue {
value = 112
}
}
}
collection_group {
collect_every = 10
time_threshold = 50
metric {
name = &quot;temp&quot;
title = &quot;Temperature&quot;
value_threshold = 70
}
}
  

有了这两个文件,这个module就算添加成功了。


更多的用户贡献的module,请查看 https://github.com/ganglia/gmond_python_modules


其中包括elasticsearch、filecheck、nginx_status、MySQL等常见服务的监控metric对应的module,非常有用,只需要稍作修改,即可满足自己的需求。





其他的一些比较实用的用户贡献的工具

  • ganglia-alert :获取gmetad数据,并报警  https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-alert
  • ganglia-docker:在docker中使用ganglia,https://github.com/ganglia/ganglia_contrib/tree/master/docker
  • gmetad-health-check:监控gmetad服务状态,如果down掉,则restart服务, https://github.com/ganglia/ganglia_contrib/tree/master/gmetad_health_checker
  • chef-ganglia:用chef部署ganglia, https://github.com/ganglia/chef-ganglia
  • ansible-ganglia: 使用ansible自动化部署ganglia,https://github.com/remysaissy/ansible-ganglia
  • ganglia-nagios: 集成nagios和ganglia,https://github.com/ganglia/ganglios
  • ganglia-api : 对外提供rest api,以特定&#26684;式返回gmetad收集到的数据, https://github.com/guardian/ganglia-api


  如有问题,欢迎留言讨论。
  

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-143755-1-1.html 上篇帖子: Ansible@一个高效的配置管理工具--Ansible configure management--翻译(十二) 下篇帖子: Ansible系列(7)apt模块
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表