zabbix 优化之路
Zabbix架构:未采用proxy,未使用snmp,采用单zabbix server ,多agent架构。
每天linux监控主机,有112个监控项目,windows监控主机有69个监控项目
目前监控指标:
http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
监控项目:64273个
每秒新值: 906.69
监控主机:781台
一,zabbix 遇到瓶颈问题
症状:
1,打开web界面非常慢
2,数据库慢查询很多,插入慢
Count: 1Time=24.36s (24s)Lock=0.00s (0s)Rows=0.0 (0),
zabbix@localhost
insert into history_uint (itemid,clock,ns,value) values(N,N,N,N),(N,N,N,N),(N,N,N,N)
3,日益增长的数据非常快1个月上亿条记录
7roaddba@localhost|zabbix>show table status like 'history_uint'\G;
*************************** 1. row***************************
Name: history_uint
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 191142580
Avg_row_length: 59
Data_length: 11367792640
Max_data_length: 0
Index_length: 6764806144
Data_free: 151997382656
Auto_increment: NULL
Create_time: NULL
Update_time: NULL
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options: partitioned
Comment:
二,瓶颈分析
1,,zabbix server性能瓶颈
http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
可以看到,红色的为队列,这边队列非常多,表示zabbix处理不过来了
2, nginx +php 架构瓶颈
这个瓶颈排除,目前nginx+php的配置可供10w并发
3,磁盘IO 瓶颈
# iostat-k 1
Linux 2.6.32-220.el6.x86_64 (7road.com) 03/28/2014 _x86_64_ (8 CPU)
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 48.00 12.00 696.00 12 696
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 51.00 0.00 1228.00 0 1228
目前看来磁盘IO 并不高
4,mysql数据库瓶颈
Count: 1Time=24.36s (24s)Lock=0.00s (0s)Rows=0.0 (0), zabbix@localhost
insert into history_uint (itemid,clock,ns,value) values(N,N,N,N),(N,N,N,N),(N,N,N,N)
数据太大,导致数据的查询和插入都变得非常慢
三,各个突破
1,zabbix server性能优化
http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
通过上图可以看出,zabbix处理能力已经跟不上请求数量了,开始有很多队列,然后cache的命中也变的很低。
优化措施: 增加zabbix相关进程,扩大zabbix的cache
修改zabbixserver的配置文件,修改如下:
StartPollers=360 #增加处理zabbix 数据的进程数
StartPollersUnreachable=60 #增加recheck的进程数
StartTrappers=30 #增加建立agent连接传输进程数
StartDiscoverers=30 #增加发现功能进程
CacheSize=256M #增加主机,监控项等配置缓存
StartDBSyncers=50 #增加写入db进程
HistoryCacheSize=128M #增加历史数据缓存
TrendCacheSize=64M #增加trend数据缓存
HistoryTextCacheSize=256M
ValueCacheSize=128M # 增加监控指标值缓存
调整后,发现zabbix的队列没有了,说明zabbix进程能够处理所有的agent请求了
http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
2,mysql数据库优化
Zabbix数据库中有几张大表,同时也是影响zabbix的web性能
表:history,history_uint,events,trends,trends_uint
CREATE TABLE `history` (
`itemid` bigint(20) unsigned NOT NULL,
`clock` int(11) NOT NULL DEFAULT '0',
`value` double(16,4) NOT NULL DEFAULT '0.0000',
`ns` int(11) NOT NULL DEFAULT '0',
KEY `history_1` (`itemid`,`clock`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
CREATE TABLE `history_uint` (
`itemid` bigint(20) unsigned NOT NULL,
`clock` int(11) NOT NULL DEFAULT '0',
`value` bigint(20) unsigned NOT NULL DEFAULT '0',
`ns` int(11) NOT NULL DEFAULT '0',
KEY `history_uint_1` (`itemid`,`clock`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
CREATE TABLE `events` (
`eventid` bigint(20) unsigned NOT NULL,
`source` int(11) NOT NULL DEFAULT '0',
`object` int(11) NOT NULL DEFAULT '0',
`objectid` bigint(20) unsigned NOT NULL DEFAULT '0',
`clock` int(11) NOT NULL DEFAULT '0',
`value` int(11) NOT NULL DEFAULT '0',
`acknowledged` int(11) NOT NULL DEFAULT '0',
`ns` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`eventid`),
KEY `events_1` (`source`,`object`,`objectid`,`clock`),
KEY `events_2` (`source`,`object`,`clock`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
CREATETABLE `trends` (
`itemid` bigint(20) unsigned NOT NULL,
`clock` int(11) NOT NULL DEFAULT '0',
`num` int(11) NOT NULL DEFAULT '0',
`value_min` double(16,4) NOT NULL DEFAULT '0.0000',
`value_avg` double(16,4) NOT NULL DEFAULT '0.0000',
`value_max` double(16,4) NOT NULL DEFAULT '0.0000',
PRIMARY KEY (`itemid`,`clock`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
CREATE TABLE `trends_uint` (
`itemid` bigint(20) unsigned NOT NULL,
`clock` int(11) NOT NULL DEFAULT '0',
`num` int(11) NOT NULL DEFAULT '0',
`value_min` bigint(20) unsigned NOT NULL DEFAULT '0',
`value_avg` bigint(20) unsigned NOT NULL DEFAULT '0',
`value_max` bigint(20) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`itemid`,`clock`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
以上5张表,数据大表,数据都非常大
当前zabbix运行了1个月,监控主机陆续增加到600台
表-1.1月增长
表名
数据量(行)
History
75792140
history_uint
206047753
Events
572100
trends
2949014
trends_uint
6950823
当zabbix监控主机达到780台后,数据日增长
表-1.2 日增长
表名
数据量(行)
History
10197549
history_uint
17494708
Events
##
trends
303197
trends_uint
544824
2.2 分区优化
此次仅对zabbix数据库中几张大表做了分区,包括history,history_uint,trends,trends_uint
这里没有events,因为这个表有外检,而且数据量相对其他不大,但是会影响web页面刷新,写event计划去删就好了
分区表的创建,管理都是通过下面的存储过程来维护
DELIMITER $$
/*!50003 CREATE DEFINER=`root`@`%`PROCEDURE `partition_create`(SCHEMANAME varchar(64), TABLENAME varchar(64),PARTITIONNAME varchar(64), CLOCK int)
BEGIN
/*SCHEMANAME = The DB schema in which to make changes
TABLENAME = The on which to create PARTITIONNAME
PARTITIONNAME = The name of the partition to create
CLOCK = PARTITIONNAME should include values with a clock less thanthis number
*/
/*
Verify that the partitiondoes not already exist
*/
DECLARE RETROWS INT;
SELECTCOUNT(1)INTO RETROWS
FROM information_schema.partitions
WHERE table_schema = SCHEMANAME ANDTABLE_NAME= TABLENAME AND partition_name = PARTITIONNAME;
IF RETROWS =0THEN
/*
1. Print a message indicating thata partition was created.
2. Create the SQL to create thepartition.
3. Execute the SQL from #2.
*/
SELECT CONCAT("create_partition(", SCHEMANAME,",", TABLENAME,",",PARTITIONNAME,",", CLOCK,")")AS msg;
SET @SQL= CONCAT('ALTER TABLE ', SCHEMANAME,'.', TABLENAME,
' ADD PARTITION (PARTITION ', PARTITIONNAME,' VALUES LESS THAN (', CLOCK,'));');
PREPARE STMT FROM @SQL;
EXECUTE STMT;
DEALLOCATE PREPARE STMT;
ENDIF;
END*/$$
DELIMITER ;
DELIMITER $$
/*!50003 CREATE DEFINER=`root`@`%`PROCEDURE `partition_drop`(SCHEMANAME VARCHAR(64), TABLENAME VARCHAR(64),DELETE_BELOW_PARTITION_DATE VARCHAR(64))
BEGIN
/*
SCHEMANAME = The DB schema in which to make changes
TABLENAME = The table with partitions to potentially delete
DELETE_BELOW_PARTITION_DATE = Delete any partitions with names thatare dates older than this one (yyyymmdd)
*/
DECLARE done INTDEFAULTFALSE;
DECLARE drop_part_name VARCHAR(9);
/*
Get a list of all the partitions that are older than the date
in DELETE_BELOW_PARTITION_DATE. All partitions are prefixed with
a "p", so use SUBSTRING TO get rid of that character.
*/
DECLARE myCursor CURSOR FOR
SELECT partition_name
FROM information_schema.partitions
WHERE table_schema = SCHEMANAME ANDTABLE_NAME= TABLENAME ANDSUBSTRING(partition_name FROM2) CREATE_NEXT_DAYS THEN
LEAVE create_loop;
END IF;
END LOOP;
SET OLD_CLOCK = DATE_SUB(NOW(),INTERVAL KEEP_HISTORY_DAYS DAY);
SET OLDER_THAN_PARTITION_DATE = DATE_FORMAT( OLD_CLOCK, '%Y%m%d' );
CALL partition_drop( SCHEMA_NAME, "history",OLDER_THAN_PARTITION_DATE );
CALL partition_drop( SCHEMA_NAME, "history_log",OLDER_THAN_PARTITION_DATE );
CALL partition_drop( SCHEMA_NAME, "history_str",OLDER_THAN_PARTITION_DATE );
CALL partition_drop( SCHEMA_NAME, "history_text",OLDER_THAN_PARTITION_DATE );
CALL partition_drop( SCHEMA_NAME, "history_uint",OLDER_THAN_PARTITION_DATE );
SET OLD_CLOCK = DATE_SUB(NOW(),INTERVAL KEEP_TREND_DAYS DAY);
SET OLDER_THAN_PARTITION_DATE = DATE_FORMAT( OLD_CLOCK, '%Y%m%d' );
CALL partition_drop( SCHEMA_NAME, "trends",OLDER_THAN_PARTITION_DATE );
CALL partition_drop( SCHEMA_NAME, "trends_uint",OLDER_THAN_PARTITION_DATE );
END */$$
DELIMITER ;
DELIMITER $$
/*!50003 CREATE DEFINER=`root`@`%`PROCEDURE `partition_verify`(SCHEMANAME VARCHAR(64), TABLENAME VARCHAR(64))
BEGIN
DECLARE CUR_CLOCK TIMESTAMP;
DECLARE PARTITION_NAME VARCHAR(16);
DECLARE CLOCK INT;
DECLARE RETROWS INT;
SELECT COUNT(1) INTO RETROWS
FROM information_schema.partitions
WHERE table_schema = SCHEMANAME AND table_name = TABLENAME ANDpartition_name IS NULL;
IF RETROWS = 1 THEN
SET CUR_CLOCK = NOW();
SET PARTITION_NAME =DATE_FORMAT(CUR_CLOCK, 'p%Y%m%d');
SET CLOCK =UNIX_TIMESTAMP(DATE_FORMAT(DATE_ADD(CUR_CLOCK ,INTERVAL 1 DAY),'%Y-%m-%d00:00:00'));
SET @sql = CONCAT("ALTER TABLE", SCHEMANAME, ".", TABLENAME, " PARTITION BYRANGE(`clock`)");
SET @sql = CONCAT(@sql,"(PARTITION ", PARTITION_NAME, " VALUES LESS THAN (",CLOCK, "));");
PREPARE STMT FROM @sql;
EXECUTE STMT;
DEALLOCATE PREPARE STMT;
END IF;
END */$$
DELIMITER ;
一共包含4个存储过程
· partition_create :创建新的分区
· partition_drop:删除指定分区
· partition_maintenance:分区表维护
· partition_verify:检查分区可用性
2.3使用方法
(1),partition_create
partition_create(SCHEMANAME varchar(64),TABLENAME varchar(64), PARTITIONNAME varchar(64), CLOCK int)
Example: CALL partition_create("zabbix", "history","p20131216", 1387267200);
SCHEMANAME = 操作对象表所在数据库名
TABLENAME = 需要分区的表名
PARTITIONNAME = 分区名
CLOCK = 使用range,所以指定clock大小
(2),partition_drop
partition_drop(SCHEMANAME VARCHAR(64),TABLENAME VARCHAR(64), DELETE_BELOW_PARTITION_DATE VARCHAR(64))
Example: CALL partition_drop("zabbix", "history","20131216");
SCHEMANAME =操作对象表所在数据库名
TABLENAME = 需要分区的表名
DELETE_BELOW_PARTITION_DATE = 删除指定日期(会删除此日期前的所有分区)
(3),partition_maintenance
partition_maintenance(KEEP_HISTORY_DAYSINT, KEEP_TREND_DAYS INT, CREATE_NEXT_DAYS INT)
Example: CALL partition_maintenance(28,730,14);
KEEP_HISTORY_DAYS = 保留history数据的天数
KEEP_TREND_DAYS = 保留trend数据的天数
CREATE_NEXT_DAYS = 提前创建分区的天数
(4),partition_verify
partition_verify(SCHEMANAME VARCHAR(64),TABLENAME VARCHAR(64))
Example: CALL partition_verify("zabbix", "history");
SCHEMANAME = 操作对象表所在数据库名
TABLENAME =需要分区的表名
这里通过另外一个存储过程+event的方式去自动维护分区表
方案是: 每天凌晨1.30执行,保留各分区1个月数据,每天创建新分区(都是按天来做分区的) ; events表,仅保留5万条记录
2.4 分区表查询
例如: 查看history分区信息
>Show create table history
CREATE TABLE `history` (
`itemid` bigint(20) unsigned NOT NULL,
`clock` int(11) NOT NULL DEFAULT '0',
`value` double(16,4) NOT NULL DEFAULT '0.0000',
`ns` int(11) NOT NULL DEFAULT '0',
KEY`history_1` (`itemid`,`clock`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (`clock`)
(PARTITION p20140326 VALUES LESS THAN(1395849600) ENGINE = InnoDB,
PARTITION p20140327 VALUES LESS THAN(1395936000) ENGINE = InnoDB,
PARTITION p20140328 VALUES LESS THAN(1396022400) ENGINE = InnoDB,
PARTITION p20140329 VALUES LESS THAN(1396108800) ENGINE = InnoDB,
PARTITION p20140330 VALUES LESS THAN(1396195200) ENGINE = InnoDB,
PARTITIONp20140331 VALUES LESS THAN (1396281600) ENGINE = InnoDB,
PARTITION p20140401 VALUES LESS THAN(1396368000) ENGINE = InnoDB,
PARTITION p20140402 VALUES LESS THAN(1396454400) ENGINE = InnoDB,
PARTITION p20140403 VALUES LESS THAN(1396540800) ENGINE = InnoDB,
PARTITION p20140404 VALUES LESS THAN(1396627200) ENGINE = InnoDB,
PARTITION p20140405 VALUES LESS THAN(1396713600) ENGINE = InnoDB,
PARTITION p20140406 VALUES LESS THAN(1396800000) ENGINE = InnoDB,
PARTITION p20140407 VALUES LESS THAN(1396886400) ENGINE = InnoDB,
PARTITION p20140408 VALUES LESS THAN(1396972800) ENGINE = InnoDB,
PARTITION p20140409 VALUES LESS THAN(1397059200) ENGINE = InnoDB,
PARTITION p20140410 VALUES LESS THAN(1397145600) ENGINE = InnoDB,
PARTITION p20140411 VALUES LESS THAN(1397232000) ENGINE = InnoDB,
PARTITION p20140412 VALUES LESS THAN(1397318400) ENGINE = InnoDB,
PARTITION p20140413 VALUES LESS THAN(1397404800) ENGINE = InnoDB,
PARTITION p20140414 VALUES LESS THAN(1397491200) ENGINE = InnoDB,
PARTITIONp20140415 VALUES LESS THAN (1397577600) ENGINE = InnoDB,
PARTITION p20140416 VALUES LESS THAN(1397664000) ENGINE = InnoDB,
PARTITION p20140417 VALUES LESS THAN(1397750400) ENGINE = InnoDB,
PARTITION p20140418 VALUES LESS THAN(1397836800) ENGINE = InnoDB,
PARTITION p20140419 VALUES LESS THAN(1397923200) ENGINE = InnoDB,
PARTITION p20140420 VALUES LESS THAN(1398009600) ENGINE = InnoDB,
PARTITION p20140421 VALUES LESS THAN(1398096000) ENGINE = InnoDB,
PARTITION p20140422 VALUES LESS THAN (1398182400)ENGINE = InnoDB,
PARTITION p20140423 VALUES LESS THAN(1398268800) ENGINE = InnoDB,
PARTITION p20140424 VALUES LESS THAN(1398355200) ENGINE = InnoDB,
PARTITION p20140425 VALUES LESS THAN(1398441600) ENGINE = InnoDB)
查看当前分区数据量
zabbix>select * frominformation_schema.partitions where TABLE_SCHEMA='zabbix' andTABLE_NAME='history' limit 3\G;
*************************** 1. row***************************
TABLE_CATALOG: def
TABLE_SCHEMA: zabbix
TABLE_NAME: history
PARTITION_NAME: p20140326
SUBPARTITION_NAME: NULL
PARTITION_ORDINAL_POSITION: 1
SUBPARTITION_ORDINAL_POSITION: NULL
PARTITION_METHOD: RANGE
SUBPARTITION_METHOD: NULL
PARTITION_EXPRESSION: `clock`
SUBPARTITION_EXPRESSION: NULL
PARTITION_DESCRIPTION: 1395849600
TABLE_ROWS: 62428485
AVG_ROW_LENGTH: 67
DATA_LENGTH: 4202496000
MAX_DATA_LENGTH: NULL
INDEX_LENGTH: 2451488768
DATA_FREE: 0
CREATE_TIME: NULL
UPDATE_TIME: NULL
CHECK_TIME: NULL
CHECKSUM: NULL
PARTITION_COMMENT:
NODEGROUP: default
TABLESPACE_NAME: NULL
*************************** 2. row***************************
TABLE_CATALOG: def
TABLE_SCHEMA: zabbix
TABLE_NAME: history
PARTITION_NAME: p20140327
SUBPARTITION_NAME: NULL
PARTITION_ORDINAL_POSITION: 2
SUBPARTITION_ORDINAL_POSITION: NULL
PARTITION_METHOD: RANGE
SUBPARTITION_METHOD: NULL
PARTITION_EXPRESSION: `clock`
SUBPARTITION_EXPRESSION: NULL
PARTITION_DESCRIPTION: 1395936000
TABLE_ROWS: 10197549
AVG_ROW_LENGTH: 59
DATA_LENGTH: 611319808
MAX_DATA_LENGTH: NULL
INDEX_LENGTH: 434110464
DATA_FREE: 0
CREATE_TIME: NULL
UPDATE_TIME: NULL
CHECK_TIME: NULL
CHECKSUM: NULL
PARTITION_COMMENT:
NODEGROUP: default
TABLESPACE_NAME: NULL
*************************** 3. row***************************
TABLE_CATALOG: def
TABLE_SCHEMA: zabbix
TABLE_NAME: history
PARTITION_NAME: p20140328
SUBPARTITION_NAME: NULL
PARTITION_ORDINAL_POSITION: 3
SUBPARTITION_ORDINAL_POSITION: NULL
PARTITION_METHOD: RANGE
SUBPARTITION_METHOD: NULL
PARTITION_EXPRESSION: `clock`
SUBPARTITION_EXPRESSION: NULL
PARTITION_DESCRIPTION: 1396022400
TABLE_ROWS: 6893834
AVG_ROW_LENGTH: 59
DATA_LENGTH: 410894336
MAX_DATA_LENGTH: NULL
INDEX_LENGTH: 293601280
DATA_FREE: 0
CREATE_TIME: NULL
UPDATE_TIME: NULL
CHECK_TIME: NULL
CHECKSUM: NULL
PARTITION_COMMENT:
NODEGROUP: default
TABLESPACE_NAME: NULL
3 rows in set (0.22 sec)
查看zabbix库中,有哪些表做了分区
|zabbix>select distinct(TABLE_NAME) frominformation_schema.partitions where TABLE_SCHEMA='zabbix'and PARTITION_NAME is not null;
+--------------+
| TABLE_NAME |
+--------------+
| history |
| history_log|
| history_str|
| history_text |
| history_uint |
| trends |
| trends_uint|
+--------------+
分区表管理
删除问题:
Alter table history drop partitionp20140328
新增分区:
Alter table history add partition(partition p20140328 value less than 1396022400);
合并分区:
Alter table history REORGANIZE PARTITIONp20140325,p20140326,p20140327,p20140328 into (partitionp20140328value less than 1396022400);
存储过程和event见 :zabbix_sp.sql 和event.sql
更详细文章见:
https://www.zabbix.org/wiki/Docs/howto/mysql_partition
页:
[1]