An effective and common Performance test data monitor and collection tool based

搜诶符合你 · 发表于 2015-9-8 08:33:31

1    Abstract
　　This paper introduces an effective and common Performance test data monitor and collection tool based on Nagios for cisco video solutions.  The innovation of this paper is to develop nagios plugins that can be applied to most of SPVTG video solution subsystems, and can monitor and collect these subsystems CPU usage, Disk I/O, Memory usage, Network Traffic performance data. All performance data can be show in Nagios platform by host/service/graph mode. This tool can also be applied in automation of performance test case result report creation. It has been proved to be an very efficient and effective tool for integration and performance test. The collected performance data is also very important for real customer deployment.

2    Background
　　In SPVTG, we have many video solutions, such as Videoscape, IPnDVR, TSTV for YES, MOS 2.0, etc. All these solutions are end-to-end solution, which have subsystems to be responsible for video encoding, video transcoding, video recording, video storage, video encapsulation, video delivery, video content management, video playback, etc.
　　In these solution test, we have subsystem dev-test, subsystem performance test, solution integration test, solution performance test. Especially for performance test, whether for subsystem or solution test, they all need to run load and scale test to stress these subsystems, and to test their performance and bottleneck.

2.1    SPVTG popular video solution overview
　　

2.1.1 Videoscape 4.5 solution Architecture
　　VSR 4.5 presents a migration to an all-IP based video and services infrastructure spanning the full service/content life cycle, from the video content and metadata acquisition, to content and metadata preparation, distribution and delivery to the end-user.

Fig 1 VSR4.5 Architecture

From Fig 1, this solution include below subsystems:

Cisco Videoscape Media Suite (VMS, Centos linux OS)
Cisco Media Processor (CMP,Windows 2008/XP OS)
Cisco Transcoder Manager (CTM, Windows 2008/XP OS)
Cisco Media Processor Management Control (CMPMC, Windows 2008 OS)
Cisco Distribution Suite (CDS-IS, Cisco customized Linux OS)
Magento (Centos Linux OS)
Apache origin server (Centos linux OS)
Microsoft IIS origin server (Windows 2008 OS)

　　Videoscape 4.5 Performance testbed, we will scale most subsystem for load and performance test. We have a VLE tool ( Video Load Emulator ) to simulate thousands of request to subsystems, monitor them and collection performance data.
　　For VMS subsystem, we have 14 nodes scale. For CTM sybsystem, we have 15 nodes scale.
　　For CMP subsystem, we have 10 nodes scale.Totally, together with other subsystems, we have at least 40 nodes in our performance testbed.
　　

Fig 2 VSR 4.5 Performance testbed UCS deployment

2.1.2 Media Origination System 2.0 solution Architecture
　　The Media Origination system provides the media data plane required to record and deliver the content to various end points as well as provide the origin storage/VOD ingest application required to ingest common format contents in to the system. The system also includes a set of functions that are required to acquire schedules and manage the recording tasks among a set of recorder resources. More details about the system will be described in the subsequent sections.

Fig 3 MOS 2.0 Architecture

　　This solution include below subsystems:
l  Cisco Videoscape Media Suite (VMS, Centos linux OS)
l  Cisco Media Processor (CMP,Windows 2008/XP OS)
l  Cisco Transcoder Manager (CTM, Windows 2008/XP OS)
l  Cisco Distribution Suite (CDS-IS, Cisco customized CDE Linux OS)
l  Record Manager (RM, Cisco customized Linux OS)
l  CDE 460/470 (Cisco customized CDE Linux OS)
l  VDS-OS (VOS, Cisco customized CDE Linux OS)
l  VOSM (Cisco customized CDE Linux OS)
l  CDSM (Cisco customized CDE Linux OS)
l  Key Management Server( KMS, Centos linux OS)
l  Service Manager ( SM, Centos linux OS)
　　We can see MOS 2.0 solution release has shared some subsystems, such as VMS, CTM, etc.  There are other video solution or project, such as IPnDVR, TSTV for YES, VSR5.0.1 also include these subsystem and extra subsystem.

2.2    SPVTG video solution performance test challenge
　　In VSR4.5 and MOS2.0, we have a performance test tool VLE to simulate thousands of request for performance test.
　　In VSR4.5 performance test, we use VLE to simulate 10K, 20K, 50K request for video playback and transaction request. At the same time, we need to open a session for each subsystem nodes, keep monitor these nodes CPU/Disk/RAM/Network traffic data and collect them.
　　We usually has 2 engineers to do this job, one is running test case, another is responsible for collecting data. Besides, we monitor 2 hours and estimate a rough average CPU/Disk/RAM/Network traffic data. From this way, we can’t get accurate performance data. Sometimes, we need to longevity test, test cases need to be executed by 24 hours, 48 hours or even a week. We can’t get accurate performance data also.
　　Monitor and collect data has occupied our so much time, together with  tight schedule  for release, we just has  few  time on system performance analysis and do further data analysis.
　　For Performance test and integration test, we need to handle these types of subsystems, we do need a common tool to monitor and collect system basic performance data. If we have a common tool, if can be applied to all our video solution test and also subsystem dev test.

3    Nagios introduction

3.1    Nagios overview
　　Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. Nagios is a host and service monitor designed to inform you of network problems before your clients, end-users or managers do. The monitoring daemon runs intermittent checks on hosts and services you specify using external "plugins" which return status information to Nagios. When problems are encountered, the daemon can send notifications out to administrative contacts in a variety of different ways (email, instant message, SMS, etc.). Current status information, historical logs, and reports can all be accessed via a web browser. The monitoring of IT infrastructure is highly visible and important component of any company.
　　Nagios, Cacti, Zabbix are popular IT monitoring tools. Compared these 3 tools advantages and disadvantages, we choose Nagios as our Performance test data monitor and collection tool. Belows are Nagios functions and advantages:
　　Nagios Functions are as bellows:

Plan for infrastructure upgrades before outdated systems cause failures
Respond to issues at the first sign of a problem
Automatically fix problems when they are detected
Coordinate technical team responses
Ensure your organization's SLAs are being met
Ensure IT infrastructure outages have a minimal effect on your organization's bottom line
Monitor your entire infrastructure and business processes

　　Nagios Advantage are as bellows:

open source
large community
many powerful plugins (and own plugins are easy to create: just write a program that prints a one-line string and set a certain return code)
easy-to-use web frontend
debugging plugins is moderately simple.
many thought-out features like host groups or notification options that make your life easier
dependencies (so that you don't get 100 alerts if a router between the Nagios server and other servers went down)
nagvis plugin with a great interactive editor that draws nice management-suitable graphs (although I found the ndo2db interface hard to set up at first and a little flaky)

3.2 Why nagios NRPE can’t be used in our video solution
　　The NRPE addon is designed to allow you to execute Nagios plugins on remote Linux/Unix machines. The main reason for doing this is to allow Nagios to monitor "local" resources (like CPU load, memory usage, etc.) on remote machines. Since these public resources are not usually exposed to external machines, an agent like NRPE must be installed on the remote Linux/Unix machines.

The check_nrpe plugin, which resides on the local monitoring machine.
The NRPE daemon, which runs on the remote Linux/Unix machine.

　　This NRPE is very useful to apply in common type of Linux, such as Centos, Redhat, etc. But in our video solution subsystem, we have some subsystem, such as VDS-OS, CDS-IS, there are cisco type customized linux.
　　There are 2 reasons we can’t use NRPE to monitor our subsystems:
　　NRPE installation require to compile and install software, for compile process, it need some basice library and packages, such as gcc. To install these prepare package, it will have another dependence on other packages. In our cisco customized and reduced linux systems, its shell environment is a zzdebugshell, which has limited command. It doesn’t support for yum at all. So we can’t use yum to install package and its dependence packages.
　　Besides, from system security view, our subsystem may not be allowed to install these packages, it may trigger some system potential insecure issue.
　　

4 Design of Performance test data monitor and collection tool based on Nagios
　　The paper provides a new nagios plugins mechanism for solution and subsystem performance test. This new mechanism has been used in VSR4.6 performance testing project and has demonstrated it can be applied to most of our SPVTG video solutions. It also resolved NRPE can’t be used in video sulotion problem.
　　The advantages of our nagios plugins mechanism over NRPE mechanism is as bellows:

This plugin plays a role like a NRPE. It can be applied to most of SPVTG video solution subsystems. But NRPE mechanism can only be applied to standard linux systems.
This plugin is installed in Nagios server client side, Nagios server take an active role to expect tool to interoperate with video solution system to monitor and collect our performance data.

4.1 Framework design
　　From below Fig 4 framework, green modules are our development plugins, blue part are nagios platform. In remote common linux subsystems, it need to deploy its type of Core Plugins. In remote cisco customized linux subsystem, it need to deploy its Core Plugin to them.
　　In Nagios server side, it include 4 parts:
　　1. Nagios Core tool
　　Nagios Core is the monitoring and alerting engine that serves as the primary application around which hundreds of Nagios projects are built. It serves as the basic event scheduler, event processor, and alert manager for elements that are monitored. It features several APIs that are used to extend its capabilities to perform additional tasks, is implemented as a daemon written in C for performance reasons, and is designed to run natively on Linux/*nix systems.

Fig 4 Design of performance test data monitor and collection tool based on Nagios

　　2. Nagiosgraph tool
　　Nagiosgraph parses output and performance data from Nagios plugins and stores the data in RRD files. nagiosgraph creates graphs and generates HTML pages with graphic reports from the data.
　　3. Core plugin development for subsystems on video solution
　　4. Nagios plugin development for Common/Cisco Customized Linux

4.2    Core plugin development for subsystems on video solution
　　For common/cisco customized linux type of subsystems, we need to monitor and collect cpu usage, disk I/O, memory usage and network traffic .
　　We have total 5 core plugins, include check_cpu.sh, check_disk.sh, check_ram.sh, check_traffic.sh, check_netmon.sh.
　　For check_cpu.sh, check_disk.sh, check_ram.sh, they can be used in both common linux type and customized linux type of subsystems.
　　For check_traffic.sh, it can only be used in common linux type of subsystems, this plugin  is use SNMP protocol to monitor and collect system network IN and Out traffics. It requires to configure SNMP service in subsystems, while in customized linux type of subsystems, they has deminimized linux type and has limited function, we are not allowed to do this.
　　For check_netmon.sh, it can only be used in customized linux type of subsystems, this plugin is use netmon command in subsystems to monitor and collect system network IN and OUT throughput.
　　All 5 plugins are shell scripts and platform-independent, so they can also be executed independently in linux system. There are also plugin parameters and use example in plugins comments part. Below each plugin mortoring output format is based on Nagiosgraph standards, so Nagiosgraph can analysis these kinds of format and display in graph.
　　You can download these core plugins in below link:
　　\\10.74.15.187\ftp_map_to samba\tstv-nagios\nagios-core-scripts
　　Scripts execution examples:
　　./check_cpu.sh –w 80 –c 90
　　./check_disk.sh -d sda -w 1000,1000,1000 -c 3000,3000,3000
　　./check_ram.sh 20 10
　　./check_netmon.sh 100 200
　　./check_traffic.sh -V 2c -C public -H 100.1.1.51 -I 3 -w 10000,30000 -c 40000,60000 -K -B
　　

4.3    Nagios plugin development for Common/Cisco Customized Linux
　　Nagios plugins are standalone extensions to Nagios Core that provide low-level intelligence on how to monitor anything and everything with Nagios Core. Plugins operate as standalone applications, but are generally designed to be executed by Nagios Core.
　　Plugins process command-line arguments, go about the business of performing a specific type of check, and then return the results to Nagios Core for further processing. Plugins can either be compiled binaries (written in C, C++, etc) or executable scripts (shell, Perl, PHP, etc).
　　Since some of our video solution subsystems are customized linux system, they can’t support to run language like C, C++, we choose BASH Shell scripts for our nagios plugin development.
　　In Nagios, we will define Hosts, Hosts will have service to run. In our scenarios, a service is to monitor cpu( or disk I/O, memory, network traffic) in one remote host.

Fig 5 Nagios Service,Command, Plugin relattion

　　For our nagios plugin development for Common/Cisco Customized linux, we design 2 parts for each data monitor.
　　The first part is use Expect tools. Expect is a tool for automating interactive applications such as ssh, telnet, ftp, passwd, fsck, rlogin, tip, etc. In our Expect Common Linux type plugin, we use Expect tool to ssh remote common linux type subsystems automatically. The first part is an expect scripts for  nagios server to automatically interoperate with remote subsystems.  It will call Core plugin to execute in remote systems, collect result and send back to nagios server. This part is different for Common and Cisco Customized linux, since SSH login process are different.
　　In our Expect Common Linux type plugin, we use Expect tool to ssh remote common linux type subsystems automatically. In this automatic seesion, our Expect Common Linux type plugin will call Core plugins ( check_cpu.sh, check_disk.sh, check_ram.sh, check_traffic.sh)  reside in remote common linux type subsystems to monitor and collect performance data to Nagios server.
　　
　　
　　
　　The second part is a shell script to call the first plugin and monitor and collect first plugin’s return performance data. It follows nagios plugin writing standards. This part is same for both Common and Cisco Customized linux.
　　From below screenshot, we can see our Nagios server need to  21 common/ customized linux type remote systems .
　　
　　Each remote system has 4 type of Nagios plugins, inlcude CPU usage, Disk I/O, Memory usage, Network traffic.
　　
　　Each plugin can be executed independently. From above screenshot, we can see the plugin running result is the same as Core plugin execution result screenshot. But this is executed in Nagios server, it is Nagios server to interoperate with remote linux systems through plugin. But below screen shot, those are executed from remote linux.
　　
　　
　　Below is a sample for Nagios command definiton for our Nagios plugins.

###################################################################
###############TB#1 Oracle CPU Usage monitor#######################
###################################################################
define command {
   command_name  check_cpu_100.1.1.51
   command_line  /opt/nagios-scripts/remote-monitor-hosts/100.1.1.51-scripts/call_remote_check_cpu.sh
   }

###################################################################
###############TB#1 Oracle Disk I/O monitor########################
###################################################################

define command {
   command_name  check_disk_100.1.1.51
   command_line  /opt/nagios-scripts/remote-monitor-hosts/100.1.1.51-scripts/call_remote_check_disk.sh
   }

###################################################################
###############TB#1 Oracle Memory monitor##########################
###################################################################

define command {
   command_name  check_ram_100.1.1.51
   command_line  /opt/nagios-scripts/remote-monitor-hosts/100.1.1.51-scripts/call_remote_check_ram.sh
   }

###################################################################
###############TB#1 Oracle Network traffic monitor#################
###################################################################

define command {
   command_name  check_traffic_100.1.1.51
   command_line /opt/nagios-scripts/remote-monitor-hosts/100.1.1.51-scripts/check_traffic.sh -V 2c -C public -H 100.1.1.51 -I 3 -w 10000,30000 -c 40000,60000 -K -B
   }

　　Below is a sample for Nagios service definiton for our Nagios plugins. These service will call above defined commands.

###################################################################
###############TB#1 Oracle CPU monitor#############################
###################################################################

define service{
      use                            local-service,graphed-service
      host_name                      tb1-nagios.vsr.com
      service_description          tb1-oracle1 CPU Usage Monitor
      check_command                check_cpu_100.1.1.51
      }

define service{
      use                            local-service,graphed-service
      host_name                      tb1-nagios.vsr.com
      service_description          tb1-oracle2 CPU Usage Monitor
      check_command                check_cpu_100.1.1.61
      }

###################################################################
###############TB#1 Oracle Disk I/O monitor########################
###################################################################

define service{
      use                            local-service,graphed-service
      host_name                      tb1-nagios.vsr.com
      service_description          tb1-oracle1 Disk I/O Monitor
      check_command                check_disk_100.1.1.51
      }

###################################################################
###############TB#1 Oracle Memory monitor##########################
###################################################################
define service{
      use                            local-service,graphed-service
      host_name                      tb1-nagios.vsr.com
      service_description          tb1-oracle1 Memory Monitor
      check_command                check_ram_100.1.1.51
      }

###################################################################
###############TB#1 Oracle Network Traffic monitor#################
###################################################################

define service{
      use                            local-service,graphed-service
      host_name                      tb1-nagios.vsr.com
      service_description          tb1-oracle1 Network traffic Monitor
      check_command                check_traffic_100.1.1.51
      }

4.4    Automation of Performance test case result report
　　When we are doing performance test, we has below 3 requirements:

We need to collect CPU/Disk/Memory/Network traffic data in a certain duration for video solution subsystems. Currently we use Nagiosgraph to display performance output, it can display Daily/Week/Month/Year performance data in graph mode on Nagios Web UI. But this tool can’t show a graph for a certain defined duration.
For a video solution performance testbed, we may have at least 30 remote nodes. Each subsystem is scaled to simulator customer deploy scenario. If we show these 30 nodes performance graph, it will have 120 graphs ( each nodes has 4 graphs for CPU, DISK, Memory, Network traffic ). This number is too much for us. Besides, we also want to make the same module nodes to show in the same graph. In total, we want to show several nodes performance data in the graph. But in Nagiosgraph, each graph is based on one remote node.
We need CPU/Disk/Memory/Network graphs can be saved in a Excel file.

　　We design a new plugin for above requirements based on rrdtool. rrdtool  is a very powerful tool for storing data and making pretty graphs. Nagiosgraph also make use of rrdtool for store and display graphs.
　　In below screenshot, Nagiosgraph rrd directory, for each nodes, it has 4 rrd files for CPU/Disk/Memory/Netowrk traffic. Nagiosgraph disply each rrd file as a graph.
　　Our plugin  is mainly use “rrdtool graph” command to access these rrd files, make customizations, display them in graphs and save as JPEG format in a folder.
　　For the automation of performance test result report creation, we has another plugin  use Microsoft Exel macro to put above JPEG graphs in the single Excel document.
　　
　　
　　
　　

5    Installation and configuration documentation
　　We have below 5 wikis to setup a Nagios monitoring system for video solutions. They include detail installation, configuration, troubleshooting informantion.
　　Nagios Installation Guide
　　http://zed.cisco.com/confluence/display/SPVTG/Nagios+Installation+Guide
　　Nagiosgraph Installation and Configuration
　　http://zed.cisco.com/confluence/display/SPVTG/Nagiosgraph+Installation+and+Configuration
　　Nagios performance monitor for common linux OS subsystem setup
　　http://zed.cisco.com/confluence/display/SPVTG/Nagios+performance+monitor+for+common+linux+OS+subsystem+setup
　　Nagios performance monitor for customized linux OS subsystem setup
　　http://zed.cisco.com/confluence/display/SPVTG/Nagios+performance+monitor+for+customized+linux+OS+subsystem+setup
　　Nagios performance monitor for Windows OS subsystem setup
　　http://zed.cisco.com/confluence/display/SPVTG/Nagios+performance+monitor+for+Windows+OS+subsystem+setup

6    Demo show
　　Nagios server login URL: http://100.1.0.200/nagios/
　　Go to Current Status->Services, you can see all nodes has 4 services for CPU usage, Disk I/O, Memory, Network traffic.

Fig 6 Nagios monitoring main page

Fig 7 Nagios CPU usage monitoring service detail page

Fig 8 Nagios Disk I/O monitoring service detail page

Fig 9 Nagios Memory usage monitoring service detail page

Fig 10 Automation of Performance test case result report sample 1

Fig 11 Automation of Performance test case result report sample 2

Fig 12 Automation of Performance test case result report sample 3

Fig 13 Automation of Performance test case result report sample 4

　　
　　
　　
　　
　　

7    Next step for performance test data monitor and collection on service level
　　Now our tool can be used in most of SPVTG video solution subsystems, it can monitor and collect basic performance data, include CPU usage, Disk I/O, Memory usge, Network Traffic. In next step for further development, we need to develop nagios plugin for subsystems key service and process performance resource monitoring, running status, exception inform and alert, etc. We can get more accurate performance data and evaluate our system better. From service provider view, they also have such requirements for these data and report. This is very important for real customer deployment. With this new tool, it can help integration and performance tester to have more time to test system performance and  bottleneck.

8    Reference
　　http://www.nagios.org/projects/nagioscore
　　http://nagiosgraph.sourceforge.net/
　　http://www.nagios.org/about/overview
　　https://workaround.org/try-zabbix
　　

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

Red Hat RHCE 8 (EX294) Cert Guide

c++ size_t 和 int 的区别

HERE 使用 AWS EF 和 JFrog Artifactory 打

C++ 指针大全：从基础到进阶，一篇快速上手

wirelessnetview好用的无线分析工具

亿图图示专家(EDraw Max) V7.9 中文破解版

An effective and common Performance test data monitor and collection tool based

浏览过的版块

扫码加入运维网微信交流群