lang110 发表于 2019-1-17 13:30:33

zabbix 优化之路

  Zabbix架构:
  未采用proxy,未使用snmp,采用单zabbix server ,多agent架构。
  每天linux监控主机,有112个监控项目,windows监控主机有69个监控项目
  目前监控指标:
  http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
  监控项目:64273个
  每秒新值: 906.69
  监控主机:781台
  一,zabbix 遇到瓶颈问题
  症状:
  1,打开web界面非常慢
  2,数据库慢查询很多,插入慢
  Count: 1Time=24.36s (24s)Lock=0.00s (0s)Rows=0.0 (0),
  zabbix@localhost
  insert into history_uint (itemid,clock,ns,value) values(N,N,N,N),(N,N,N,N),(N,N,N,N)
  3,日益增长的数据非常快1个月上亿条记录
  7roaddba@localhost|zabbix>show table status like 'history_uint'\G;
  *************************** 1. row***************************
  Name: history_uint
  Engine: InnoDB
  Version: 10
  Row_format: Compact
  Rows: 191142580
  Avg_row_length: 59
  Data_length: 11367792640
  Max_data_length: 0
  Index_length: 6764806144
  Data_free: 151997382656
  Auto_increment: NULL
  Create_time: NULL
  Update_time: NULL
  Check_time: NULL
  Collation: utf8_general_ci
  Checksum: NULL
  Create_options: partitioned
  Comment:
  二,瓶颈分析
  1,,zabbix server性能瓶颈
http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
  可以看到,红色的为队列,这边队列非常多,表示zabbix处理不过来了
  2, nginx +php 架构瓶颈
  这个瓶颈排除,目前nginx+php的配置可供10w并发
  3,磁盘IO 瓶颈
  # iostat-k 1
  Linux 2.6.32-220.el6.x86_64 (7road.com)         03/28/2014      _x86_64_         (8 CPU)
  Device:            tps    kB_read/s   kB_wrtn/s    kB_read    kB_wrtn
  sda            48.00      12.00       696.00         12      696
  Device:            tps    kB_read/s   kB_wrtn/s    kB_read    kB_wrtn
  sda            51.00         0.00   1228.00          0       1228
  目前看来磁盘IO 并不高
  4,mysql数据库瓶颈
  Count: 1Time=24.36s (24s)Lock=0.00s (0s)Rows=0.0 (0), zabbix@localhost
  insert into history_uint (itemid,clock,ns,value) values(N,N,N,N),(N,N,N,N),(N,N,N,N)
  数据太大,导致数据的查询和插入都变得非常慢
  三,各个突破
  1,zabbix server性能优化
http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
  通过上图可以看出,zabbix处理能力已经跟不上请求数量了,开始有很多队列,然后cache的命中也变的很低。
  优化措施: 增加zabbix相关进程,扩大zabbix的cache
  修改zabbixserver的配置文件,修改如下:
  StartPollers=360                   #增加处理zabbix 数据的进程数
  StartPollersUnreachable=60          #增加recheck的进程数
  StartTrappers=30                   #增加建立agent连接传输进程数
  StartDiscoverers=30                #增加发现功能进程
  CacheSize=256M                   #增加主机,监控项等配置缓存
  StartDBSyncers=50               #增加写入db进程
  HistoryCacheSize=128M             #增加历史数据缓存
  TrendCacheSize=64M                #增加trend数据缓存
  HistoryTextCacheSize=256M
  ValueCacheSize=128M             # 增加监控指标值缓存
  调整后,发现zabbix的队列没有了,说明zabbix进程能够处理所有的agent请求了
http://blog.运维网.com/e/u261/themes/default/images/spacer.gif
  2,mysql数据库优化
  Zabbix数据库中有几张大表,同时也是影响zabbix的web性能
  表:history,history_uint,events,trends,trends_uint
  CREATE TABLE `history` (
  `itemid` bigint(20) unsigned NOT NULL,
  `clock` int(11) NOT NULL DEFAULT '0',
  `value` double(16,4) NOT NULL DEFAULT '0.0000',
  `ns` int(11) NOT NULL DEFAULT '0',
  KEY `history_1` (`itemid`,`clock`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8
  CREATE TABLE `history_uint` (
  `itemid` bigint(20) unsigned NOT NULL,
  `clock` int(11) NOT NULL DEFAULT '0',
  `value` bigint(20) unsigned NOT NULL DEFAULT '0',
  `ns` int(11) NOT NULL DEFAULT '0',
  KEY `history_uint_1` (`itemid`,`clock`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8
  CREATE TABLE `events` (
  `eventid` bigint(20) unsigned NOT NULL,
  `source` int(11) NOT NULL DEFAULT '0',
  `object` int(11) NOT NULL DEFAULT '0',
  `objectid` bigint(20) unsigned NOT NULL DEFAULT '0',
  `clock` int(11) NOT NULL DEFAULT '0',
  `value` int(11) NOT NULL DEFAULT '0',
  `acknowledged` int(11) NOT NULL DEFAULT '0',
  `ns` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`eventid`),
  KEY `events_1` (`source`,`object`,`objectid`,`clock`),
  KEY `events_2` (`source`,`object`,`clock`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8
  CREATETABLE `trends` (
  `itemid` bigint(20) unsigned NOT NULL,
  `clock` int(11) NOT NULL DEFAULT '0',
  `num` int(11) NOT NULL DEFAULT '0',
  `value_min` double(16,4) NOT NULL DEFAULT '0.0000',
  `value_avg` double(16,4) NOT NULL DEFAULT '0.0000',
  `value_max` double(16,4) NOT NULL DEFAULT '0.0000',
  PRIMARY KEY (`itemid`,`clock`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8
  CREATE TABLE `trends_uint` (
  `itemid` bigint(20) unsigned NOT NULL,
  `clock` int(11) NOT NULL DEFAULT '0',
  `num` int(11) NOT NULL DEFAULT '0',
  `value_min` bigint(20) unsigned NOT NULL DEFAULT '0',
  `value_avg` bigint(20) unsigned NOT NULL DEFAULT '0',
  `value_max` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`itemid`,`clock`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8
  以上5张表,数据大表,数据都非常大
  当前zabbix运行了1个月,监控主机陆续增加到600台
  表-1.1月增长
表名
数据量(行)
History
75792140
history_uint
206047753
Events
572100
trends
2949014
trends_uint
6950823

  当zabbix监控主机达到780台后,数据日增长
表-1.2 日增长
表名
数据量(行)
History
10197549
history_uint
17494708
Events
##
trends
303197
trends_uint
544824
  2.2 分区优化
  此次仅对zabbix数据库中几张大表做了分区,包括history,history_uint,trends,trends_uint
  这里没有events,因为这个表有外检,而且数据量相对其他不大,但是会影响web页面刷新,写event计划去删就好了
  分区表的创建,管理都是通过下面的存储过程来维护
DELIMITER $$

/*!50003 CREATE DEFINER=`root`@`%`PROCEDURE `partition_create`(SCHEMANAME varchar(64), TABLENAME varchar(64),PARTITIONNAME varchar(64), CLOCK int)
BEGIN
   /*SCHEMANAME = The DB schema in which to make changes
      TABLENAME = The on which to create PARTITIONNAME
      PARTITIONNAME = The name of the partition to create
      CLOCK = PARTITIONNAME should include values with a clock less thanthis number
   */
    /*
      Verify that the partitiondoes not already exist
   */

    DECLARE RETROWS INT;
    SELECTCOUNT(1)INTO RETROWS
    FROM information_schema.partitions
    WHERE table_schema = SCHEMANAME ANDTABLE_NAME= TABLENAME AND partition_name = PARTITIONNAME;

    IF RETROWS =0THEN
      /*
         1. Print a message indicating thata partition was created.
         2. Create the SQL to create thepartition.
         3. Execute the SQL from #2.
      */
      SELECT CONCAT("create_partition(", SCHEMANAME,",", TABLENAME,",",PARTITIONNAME,",", CLOCK,")")AS msg;
      SET @SQL= CONCAT('ALTER TABLE ', SCHEMANAME,'.', TABLENAME,
      ' ADD PARTITION (PARTITION ', PARTITIONNAME,' VALUES LESS THAN (', CLOCK,'));');
      PREPARE STMT FROM @SQL;
      EXECUTE STMT;
      DEALLOCATE PREPARE STMT;
    ENDIF;
END*/$$
DELIMITER ;

DELIMITER $$

/*!50003 CREATE DEFINER=`root`@`%`PROCEDURE `partition_drop`(SCHEMANAME VARCHAR(64), TABLENAME VARCHAR(64),DELETE_BELOW_PARTITION_DATE VARCHAR(64))
BEGIN
   /*
      SCHEMANAME = The DB schema in which to make changes
      TABLENAME = The table with partitions to potentially delete
      DELETE_BELOW_PARTITION_DATE = Delete any partitions with names thatare dates older than this one (yyyymmdd)
   */
    DECLARE done INTDEFAULTFALSE;
    DECLARE drop_part_name VARCHAR(9);

    /*
      Get a list of all the partitions that are older than the date
      in DELETE_BELOW_PARTITION_DATE.   All partitions are prefixed with
      a "p", so use SUBSTRING TO get rid of that character.
   */
    DECLARE myCursor CURSOR FOR
      SELECT partition_name
      FROM information_schema.partitions
      WHERE table_schema = SCHEMANAME ANDTABLE_NAME= TABLENAME ANDSUBSTRING(partition_name FROM2) CREATE_NEXT_DAYS THEN
            LEAVE create_loop;
      END IF;
   END LOOP;

   SET OLD_CLOCK = DATE_SUB(NOW(),INTERVAL KEEP_HISTORY_DAYS DAY);
   SET OLDER_THAN_PARTITION_DATE = DATE_FORMAT( OLD_CLOCK, '%Y%m%d' );
   CALL partition_drop( SCHEMA_NAME, "history",OLDER_THAN_PARTITION_DATE );
   CALL partition_drop( SCHEMA_NAME, "history_log",OLDER_THAN_PARTITION_DATE );
   CALL partition_drop( SCHEMA_NAME, "history_str",OLDER_THAN_PARTITION_DATE );
   CALL partition_drop( SCHEMA_NAME, "history_text",OLDER_THAN_PARTITION_DATE );
   CALL partition_drop( SCHEMA_NAME, "history_uint",OLDER_THAN_PARTITION_DATE );

   SET OLD_CLOCK = DATE_SUB(NOW(),INTERVAL KEEP_TREND_DAYS DAY);
   SET OLDER_THAN_PARTITION_DATE = DATE_FORMAT( OLD_CLOCK, '%Y%m%d' );
   CALL partition_drop( SCHEMA_NAME, "trends",OLDER_THAN_PARTITION_DATE );
   CALL partition_drop( SCHEMA_NAME, "trends_uint",OLDER_THAN_PARTITION_DATE );
END */$$
DELIMITER ;

DELIMITER $$

/*!50003 CREATE DEFINER=`root`@`%`PROCEDURE `partition_verify`(SCHEMANAME VARCHAR(64), TABLENAME VARCHAR(64))
BEGIN
   DECLARE CUR_CLOCK TIMESTAMP;
   DECLARE PARTITION_NAME VARCHAR(16);
   DECLARE CLOCK INT;   
   DECLARE RETROWS INT;

   SELECT COUNT(1) INTO RETROWS
   FROM information_schema.partitions
   WHERE table_schema = SCHEMANAME AND table_name = TABLENAME ANDpartition_name IS NULL;

   IF RETROWS = 1 THEN
      SET CUR_CLOCK = NOW();
      SET PARTITION_NAME =DATE_FORMAT(CUR_CLOCK, 'p%Y%m%d');
      SET CLOCK =UNIX_TIMESTAMP(DATE_FORMAT(DATE_ADD(CUR_CLOCK ,INTERVAL 1 DAY),'%Y-%m-%d00:00:00'));

      SET @sql = CONCAT("ALTER TABLE", SCHEMANAME, ".", TABLENAME, " PARTITION BYRANGE(`clock`)");
      SET @sql = CONCAT(@sql,"(PARTITION ", PARTITION_NAME, " VALUES LESS THAN (",CLOCK, "));");

      PREPARE STMT FROM @sql;
      EXECUTE STMT;
      DEALLOCATE PREPARE STMT;
   END IF;
   END */$$
DELIMITER ;
  一共包含4个存储过程
·      partition_create :创建新的分区
·      partition_drop:删除指定分区
·      partition_maintenance:分区表维护
·      partition_verify:检查分区可用性
  2.3使用方法
  (1),partition_create
  partition_create(SCHEMANAME varchar(64),TABLENAME varchar(64), PARTITIONNAME varchar(64), CLOCK int)
  Example: CALL partition_create("zabbix", "history","p20131216", 1387267200);
  SCHEMANAME = 操作对象表所在数据库名
  TABLENAME = 需要分区的表名
  PARTITIONNAME = 分区名
  CLOCK = 使用range,所以指定clock大小
  (2),partition_drop
  partition_drop(SCHEMANAME VARCHAR(64),TABLENAME VARCHAR(64), DELETE_BELOW_PARTITION_DATE VARCHAR(64))
  Example: CALL partition_drop("zabbix", "history","20131216");
  SCHEMANAME =操作对象表所在数据库名
  TABLENAME = 需要分区的表名
  DELETE_BELOW_PARTITION_DATE = 删除指定日期(会删除此日期前的所有分区)
  (3),partition_maintenance
  partition_maintenance(KEEP_HISTORY_DAYSINT, KEEP_TREND_DAYS INT, CREATE_NEXT_DAYS INT)
  Example: CALL partition_maintenance(28,730,14);
  KEEP_HISTORY_DAYS = 保留history数据的天数
  KEEP_TREND_DAYS = 保留trend数据的天数
  CREATE_NEXT_DAYS = 提前创建分区的天数
  (4),partition_verify
  partition_verify(SCHEMANAME VARCHAR(64),TABLENAME VARCHAR(64))
  Example: CALL partition_verify("zabbix", "history");
  SCHEMANAME = 操作对象表所在数据库名
  TABLENAME =需要分区的表名
  这里通过另外一个存储过程+event的方式去自动维护分区表
  方案是: 每天凌晨1.30执行,保留各分区1个月数据,每天创建新分区(都是按天来做分区的) ; events表,仅保留5万条记录
  2.4 分区表查询
  例如: 查看history分区信息
  >Show create table history
  CREATE TABLE `history` (
  `itemid` bigint(20) unsigned NOT NULL,
  `clock` int(11) NOT NULL DEFAULT '0',
  `value` double(16,4) NOT NULL DEFAULT '0.0000',
  `ns` int(11) NOT NULL DEFAULT '0',
  KEY`history_1` (`itemid`,`clock`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8
  /*!50100 PARTITION BY RANGE (`clock`)
  (PARTITION p20140326 VALUES LESS THAN(1395849600) ENGINE = InnoDB,
  PARTITION p20140327 VALUES LESS THAN(1395936000) ENGINE = InnoDB,
  PARTITION p20140328 VALUES LESS THAN(1396022400) ENGINE = InnoDB,
  PARTITION p20140329 VALUES LESS THAN(1396108800) ENGINE = InnoDB,
  PARTITION p20140330 VALUES LESS THAN(1396195200) ENGINE = InnoDB,
  PARTITIONp20140331 VALUES LESS THAN (1396281600) ENGINE = InnoDB,
  PARTITION p20140401 VALUES LESS THAN(1396368000) ENGINE = InnoDB,
  PARTITION p20140402 VALUES LESS THAN(1396454400) ENGINE = InnoDB,
  PARTITION p20140403 VALUES LESS THAN(1396540800) ENGINE = InnoDB,
  PARTITION p20140404 VALUES LESS THAN(1396627200) ENGINE = InnoDB,
  PARTITION p20140405 VALUES LESS THAN(1396713600) ENGINE = InnoDB,
  PARTITION p20140406 VALUES LESS THAN(1396800000) ENGINE = InnoDB,
  PARTITION p20140407 VALUES LESS THAN(1396886400) ENGINE = InnoDB,
  PARTITION p20140408 VALUES LESS THAN(1396972800) ENGINE = InnoDB,
  PARTITION p20140409 VALUES LESS THAN(1397059200) ENGINE = InnoDB,
  PARTITION p20140410 VALUES LESS THAN(1397145600) ENGINE = InnoDB,
  PARTITION p20140411 VALUES LESS THAN(1397232000) ENGINE = InnoDB,
  PARTITION p20140412 VALUES LESS THAN(1397318400) ENGINE = InnoDB,
  PARTITION p20140413 VALUES LESS THAN(1397404800) ENGINE = InnoDB,
  PARTITION p20140414 VALUES LESS THAN(1397491200) ENGINE = InnoDB,
  PARTITIONp20140415 VALUES LESS THAN (1397577600) ENGINE = InnoDB,
  PARTITION p20140416 VALUES LESS THAN(1397664000) ENGINE = InnoDB,
  PARTITION p20140417 VALUES LESS THAN(1397750400) ENGINE = InnoDB,
  PARTITION p20140418 VALUES LESS THAN(1397836800) ENGINE = InnoDB,
  PARTITION p20140419 VALUES LESS THAN(1397923200) ENGINE = InnoDB,
  PARTITION p20140420 VALUES LESS THAN(1398009600) ENGINE = InnoDB,
  PARTITION p20140421 VALUES LESS THAN(1398096000) ENGINE = InnoDB,
  PARTITION p20140422 VALUES LESS THAN (1398182400)ENGINE = InnoDB,
  PARTITION p20140423 VALUES LESS THAN(1398268800) ENGINE = InnoDB,
  PARTITION p20140424 VALUES LESS THAN(1398355200) ENGINE = InnoDB,
  PARTITION p20140425 VALUES LESS THAN(1398441600) ENGINE = InnoDB)
  查看当前分区数据量
  zabbix>select * frominformation_schema.partitions where TABLE_SCHEMA='zabbix' andTABLE_NAME='history' limit 3\G;
  *************************** 1. row***************************
  TABLE_CATALOG: def
  TABLE_SCHEMA: zabbix
  TABLE_NAME: history
  PARTITION_NAME: p20140326
  SUBPARTITION_NAME: NULL
  PARTITION_ORDINAL_POSITION: 1
  SUBPARTITION_ORDINAL_POSITION: NULL
  PARTITION_METHOD: RANGE
  SUBPARTITION_METHOD: NULL
  PARTITION_EXPRESSION: `clock`
  SUBPARTITION_EXPRESSION: NULL
  PARTITION_DESCRIPTION: 1395849600
  TABLE_ROWS: 62428485
  AVG_ROW_LENGTH: 67
  DATA_LENGTH: 4202496000
  MAX_DATA_LENGTH: NULL
  INDEX_LENGTH: 2451488768
  DATA_FREE: 0
  CREATE_TIME: NULL
  UPDATE_TIME: NULL
  CHECK_TIME: NULL
  CHECKSUM: NULL
  PARTITION_COMMENT:
  NODEGROUP: default
  TABLESPACE_NAME: NULL
  *************************** 2. row***************************
  TABLE_CATALOG: def
  TABLE_SCHEMA: zabbix
  TABLE_NAME: history
  PARTITION_NAME: p20140327
  SUBPARTITION_NAME: NULL
  PARTITION_ORDINAL_POSITION: 2
  SUBPARTITION_ORDINAL_POSITION: NULL
  PARTITION_METHOD: RANGE
  SUBPARTITION_METHOD: NULL
  PARTITION_EXPRESSION: `clock`
  SUBPARTITION_EXPRESSION: NULL
  PARTITION_DESCRIPTION: 1395936000
   TABLE_ROWS: 10197549
  AVG_ROW_LENGTH: 59
  DATA_LENGTH: 611319808
  MAX_DATA_LENGTH: NULL
  INDEX_LENGTH: 434110464
  DATA_FREE: 0
  CREATE_TIME: NULL
  UPDATE_TIME: NULL
  CHECK_TIME: NULL
  CHECKSUM: NULL
  PARTITION_COMMENT:
  NODEGROUP: default
  TABLESPACE_NAME: NULL
  *************************** 3. row***************************
  TABLE_CATALOG: def
  TABLE_SCHEMA: zabbix
  TABLE_NAME: history
  PARTITION_NAME: p20140328
  SUBPARTITION_NAME: NULL
  PARTITION_ORDINAL_POSITION: 3
  SUBPARTITION_ORDINAL_POSITION: NULL
  PARTITION_METHOD: RANGE
  SUBPARTITION_METHOD: NULL
  PARTITION_EXPRESSION: `clock`
  SUBPARTITION_EXPRESSION: NULL
  PARTITION_DESCRIPTION: 1396022400
   TABLE_ROWS: 6893834
  AVG_ROW_LENGTH: 59
  DATA_LENGTH: 410894336
  MAX_DATA_LENGTH: NULL
  INDEX_LENGTH: 293601280
  DATA_FREE: 0
  CREATE_TIME: NULL
  UPDATE_TIME: NULL
  CHECK_TIME: NULL
  CHECKSUM: NULL
  PARTITION_COMMENT:
  NODEGROUP: default
  TABLESPACE_NAME: NULL
  3 rows in set (0.22 sec)
  查看zabbix库中,有哪些表做了分区
  |zabbix>select distinct(TABLE_NAME) frominformation_schema.partitions where TABLE_SCHEMA='zabbix'and PARTITION_NAME is not null;
  +--------------+
  | TABLE_NAME   |
  +--------------+
  | history      |
  | history_log|
  | history_str|
  | history_text |
  | history_uint |
  | trends       |
  | trends_uint|
  +--------------+
  分区表管理
  删除问题:
  Alter table history drop partitionp20140328
  新增分区:
  Alter table history add partition(partition p20140328 value less than 1396022400);
  合并分区:
  Alter table history REORGANIZE PARTITIONp20140325,p20140326,p20140327,p20140328 into (partitionp20140328value less than 1396022400);
  存储过程和event见 :zabbix_sp.sql 和event.sql
  更详细文章见:
  https://www.zabbix.org/wiki/Docs/howto/mysql_partition
  




页: [1]
查看完整版本: zabbix 优化之路