设为首页 收藏本站
查看: 1345|回复: 0

[经验分享] [转] KVM scalability and consolidation ratio: cache none vs cache writeback

[复制链接]

尚未签到

发表于 2015-4-10 16:39:44 | 显示全部楼层 |阅读模式
  http://www.ilsistemista.net/index.php/virtualization/43-kvm-scalability-and-consolidation-ratio-cache-none-vs-cache-writeback.html?limitstart=0

  In the latest ten years, full-virtualization technologies gained much traction. While this sometime led to an excessive virtual machines proliferation, the key concept is very appealing: as CPU performance and memory capacity relentless grow over time, why do not use this ever-increasing power to consolidate multiple operating system instances on one single, powerful server?
  If done correctly (ie: without an unnecessary grow of total OS instances), this consolidation process bring considerable lower operating costs, both from electricity and maintenance/administration standpoints.
  However, in order to extract good performance from virtual machines, it is imperative to correctly size the host virtualizer: CPU, disk, memory and network subsystems should all be capable to sustain the expected average workload, and also something more for the inevitable usage peeks.
  Usually, the most stressed component in a virtualized environment is the I/O subsystem, especially taking into account the very slow random read/write speed offered by mechanical disks. As covered in previous articles, KVM give you the choice to enable OS caching on the image file or LVM volume backing a VM's virtual disk.
  As latest Qemu versions honor write barrier requests and pass them down to the host stack, using a write-back strategy is a real option. Sure, it is not a silver bullet: there are some cases where a write-back cache could represent a problem, for example in scenario involving live-migration operation (currently, libvirt/qemu advise you to not use writeback cache during live migration, or data corruption may happens). However in many cases it is an appropriate choice.
  So, go straight to the point: how KVM cache setting affect VMs performance, host resources usage and consolidation ratio? What we are looking for, and when using a cache is not appropriate

  In this article we are going to evaluate if, and how much, different caching settings influence KVM performance at a point where consolidation ratio can be impacted. To answer this question, we will collect performance data from both guests and host machine.
  In a previous article, I explained why using a write-back cache is quite safe now. Basically, Qemu/KVM honors any flushing operation issued by the guest, so if a guest writes sensible data and issues a flush, it can be certain that data hit the physical disk platters.
  However, let me very clearly state that in some circumstances you should not use a write back cache.
  The three most common reasons to not use a writeback cache are:

  • one or more guests don't support write barrier (which are used by the host to decide when flushing its cache)
  • you need to live-migrate VMs between multiple hosts (currently libvirt warns you to not use livemigration together with caching, or data corruption may happens)
  • your workload is so cache-unfriendly that the nocache option is the better performing configuration

  Point n.1 can be simply verified by looking at your guests: in a modern operating system write barriers are surely supported, if not already enabled by default. For example, Win2000 and later automatically issue a cache flush operation each second, while EXT3-based linux distributions often need to explicitly enable barriers using the “barrier=1” mount option.
  Regarding live migration, it is a matter of thinking about your requirements; in most environment, it is not used.
  Point n.3 can be verified only after extensive testing; however, often caching is beneficial to a wide range of workloads, so it is safe to assume that it will increase performances, rather that decreasing them.
  I want to stress that I am not advocating to always, forever use caching. As stated above, there are reasonable use cases when you should not use the OS cache. Anyway, if caching brings consistent and noticeable performance improvements in the general cases, it may be worth using it. A small detour: buffering vs caching

  While this page is not strictly needed for article comprehension, I feel it is important for terminology clarifications.
  As you probably know, modern operating systems tend to aggressively cache data, using how much unused memory at their disposal.
  Indeed, issuing the “free” command on a Linux terminal show something similar to that:             total         used           free        shared        buffers       cached
Mem:       7801072       210632        7590440             0           2296         23004
-/+ buffers/cache:       185332        7615740
Swap:      8388600            0        8388600


  Notice the “buffers” and “cached” column: what they means?
  First we should note that, in Linux, every time you write to a file your are using the pagecache, while when you write to a block device (eg: a logical volume), bypassing the filesystem, you are using buffering.
  Without too much surprise, the “cached” column represents the filesystem blocks cached for later reuse (it is the so-called “pagecache”). When you read something from a file, its content ends not only in your application, but on pagecache also. When you write something, generally your writes end first in the pagecache and, only after some seconds, they hit the disks.
  How about the “buffer” column? Buffers are closely related to caching, but serve a somewhat different purpose: while a cache explicitly retains data until they are stale or they are forcedly flushed, a buffer retains data only for the smallest amount of time needed to efficiently transfer data to/from the backing device. In other word, they are a necessity due to hardware constrains: for example, as small disk transfers are quite inefficient, a buffer would accumulate smaller writes until the backing block device is closed. At this point, it flush data to the backing device and release the memory allocated for buffering. On contrary, a cache would accumulate writes for much higher threshold, basically ignoring the close syscall and, in order to improve future reads, it will maintain a local copy of the written data even after they are flushed to the backing device.
  As my KVM setup is using LVM-based virtual disks, you may wonder why, in the rest of the article, I speak about “cache” and not “buffer”. They point is that buffers can be effectively used as a long-term cache. Remember what I wrote above? Buffers are flushed when the underlying device is closed via a close() syscall. This means that if we don't close the device, buffers remain in place and data can be written/read directly from the them, rather than from the device. This is precisely what happens with Qemu/KVM on top of a LVM-based disk: while the qemu process is running, the buffer retain both read and written blocks, acting as a true cache. It's worth note that a simple VM reboot will not drain the buffers, as the qemu process is still running. In order to completely discard the accumulated buffers, you had to shutdown the VM (or kill the qemu process running it). This is the only significant difference between the buffered-backed LVM virtual disk and a real pagecache-backed file-based virtual disk: in this latter case, a shutdown will not drain the cache, so a subsequent VM start benefits from the old, still valid, data in cache.
  In short: while I am using LVM-based virtual disks that are not strictly pagecached-backed, the current Linux's buffers implementation is, in this case, very similar to a classical cache. But if they are the same, why spend and entire page arguing about the terminology? The answer is simple: historically, pagecache was somewhat more CPU-hungry then buffering. The difference is very small, at a point that with a modern CPU it is negligible, but I am quite picky when describing benchmark results :)
Testbed and methods

  Benchmarks were performed on a system equipped with:

  • PhenomII 940 CPU (4 cores @ 3.0 GHz, 1.8 GHz Northbridge and 6 MB L3 cache)
  • 8 GB DDR2-800 DRAM (in unganged mode)
  • Asus M4A78 Pro motherboard (AMD 780G + SB700 chipset)
  • 4x 500 GB hard disks (1x WD Green, 3x Seagate Barracuda) in AHCI mode, configured in software RAID10 "near" layout
  • S.O. CentOS 6.5 x64

  The operation system was installed with “basic server” profile and then I selectively installed the other softwares required (libvirtd, qemu, etc). Key systems softwares were:

  • kernel-2.6.32-431.1.2.0.1.el6.x86_64
  • qemu-kvm-0.12.1.2-2.415.el6_5.3.x86_64
  • libvirt-0.10.2-29.el6_5.2.x86_64

  To test KVM in a true multi-guests environment, I created a basic “tile” of four guests, each with VirtIO drivers in place (for both disk and network devices). A tile is composed by:

  • two (#1 and #2) Windows Server 2012 R2 64 bit guests, each with 1 GB RAM and 32 GB disk
  • two CentOS 6.5 x86_64 guests (#3 and #4), each with 512 MB RAM and 8 GB disk.

  All virtual machines use LVM-based disks, carved out by a dedicated volume group.
  Inside the tile, each VM has the following role:

  • the two (#1 and #2) Windows Server 2012 R2 64 bit guests act as fileservers. The first Win2012 guest copies, via SMB/CIFS, a ~670 MB directory (with over 44500 files) on the second Win2012 guest. After 30 seconds of idling, it copies back the directory from the peer Win2012 machine.
  • the first CentOS 6.5 x86_64 guest (#3) acts as a dynamic web and email server (using apache, mysql and postfix). This machine serves a Joomla 3.2.5 site. At the same time, it runs a postgresql benchmark (sysbench) against the fourth guest, issuing a total of 10.000 transactions with a concurrency level of 4.
  • the second CentOS 6.5 x86_64 guest (#4) acts as a pure database server (using postgresql). This guest is benchmarked by the previous CentOS VM, and at the same time it runs AB (apache benchmark) to stress it, issuing 2.000 requests with a concurrency level of 4. Moreover, it run a shell script generating 100 batches of 5 emails, each of ~56 KB, with one second wait between each batch. Finally, it sends 100 of those emails in a single, big burst. Totally, it moves about 34 MB of data.

  In short, those VMs benchmark each other. This ensure not only that the test is self-contained, without externally-induced variables, but it also stress the internal virtual network switch created by Qemu.
  I perfectly understand that I am testing very specific scenarios, so let me know what you, the reader, think about that. Do you want a more web-specific test case? Your focus is on database performance? Or fileserver speed is all that matter to you?
  Let me know your ideas!
Total benchmarks run time

  This article focuses on how well the host machine manage an ever increasing number of virtual machines. In order to present you realistic results, I run the benchmark using 1, 2 or 3 tiles (4, 8 or 12 VMs).
  The first graph depicts total wall-clock run time, ie how much time a complete benchmark run needs:
DSC0000.gif
  This first result is eloquent: enabling the write-back cache translates in much lower execution time, at a point that a 3-tile setup (12 virtual machines) performs better than a 2-tiles setup (8 virtual machines) without caching.
  But where the wb-enabled case gains the most?
DSC0001.gif
  As you can see, is the filecopy benchmark the speedup the most. This was expected: apache benchmark is basically CPU-bound, while sysbench's complex test is fsync-write bound, a situation where a writeback cache is of little help. Still, the increased filecopy speed is a very nice bonus.
  Did you notice how emails seem to basically take no time? It depend on how the SMTP protocol works: even when overloaded by other activities, postfix try hard to queue all incoming emails for later delivery. This delayed delivery phase is not directly timed, but it is another source of fsync-heavy writes.  

Host scaling: CPU

  How the host responds to the ever-increasing load? Lets start from CPU scaling:
DSC0002.gif
  At first, it seems that the write-back cache comports a noticeable toll on CPU performance: even accounting for the increased speed, total CPU load is quite high.
  However, a deeper analysis show that increased CPU load is really due to the increased WAIT time, which is the time the CPU is spending waiting for the I/O subsystem to catch-up.
  Hey, wait a moment (no pun intended): why enabling the wb cache actually results in increased disk WAIT time? The fact is that the writeback cache enables much more parallelism in the I/O stack, resulting in the disk working much harder, and more threads are concurrently marked as “executable” by the scheduler.
  While CPU is waiting for I/O to complete, the process requesting the I/O operation is blocked, but the CPU is free to execute another thread. This means that real CPU load should be obtained summing USER and SYS loads, and in this case we see only a very mild increase in CPU load between the nocache and wb-cache scenarios (perfectly justified by the increased performance).
  In short: enabling the write-back cache is a non-issue from a CPU standpoint, at least when using buffered I/O (as is the case with direct access to LVM-based disks).  

Host scalability: disks load

  We previous stated that enabling the write-back cache led to increased disk performance. The following chart proves our affirmation:
DSC0003.gif
  The wb-enabled tiles have similar disk utilization than the no-cache ones, but they provide superior speed: if we normalize for performance, write-back cache provide much higher efficiency.
  Lets spends some word on the increased average access time (await). What happens here? The answer is simple: as the write-back cache enable more I/O threads to be concurrency active, total throughput is higher but at the same time the average single-request access time grows.
  Don't let await fear you: when the cache is disabled, the running threads enjoy lower access time, but you have a lower total number of I/O active threads. If your application depends on multiple, concurrent I/O write operations, it will be condemned to serially executing many of them, leading to the perception of a slower system. Enabling the wb-cache give your application a real chance to execute multiple concurrently I/O writes, and the host system can even coalesce some of them.
  So, while in some specific, controlled, low-latency workload the nocache configuration can be the better choice, generally the write-back cache is the preferred one.
  Detailed disk perf data:
DSC0004.gif
  The wb-enabled cases show much higher read and write speeds, indeed.  

Host scalability: memory utilization

  Do using some RAM for caching lowers total free memory? Lets check:
DSC0005.gif
  This chart really need an explanation. The “MEM (w/cache)” line shows total system memory usage, while the “MEM (w/out cache)” shows real memory utilization. The key concept here is that for Linux (and other modern OSes are the same) the memory used for caching and/or buffering is not really maked as “used memory”, as it can readily freed at any time.
  So, real memory utilization is depicted by the red line. Watching this line, you can see that the wb-cache is no more memory-hungry that the nocache configuration.
  In other words: Linux only engages unused memory for caching and, when an application requires more free memory, it immediately deallocs cache for giving the application the requested memory.
  The careful reader should have a question now: how it is possible for a 8 GB host machine to happily run as much as 12 virtual machines, for a total estimated guest memory usage of over 9 GB, without heavy swapping? The answer lie in a very useful Linux feature, called KSM (kernel samepage merging). KSM enable the host system to periodically check for duplicate memory chunks, and to deduplicate them when found. In short, if two memory locations have the same content, KSM marks the first location as a CoW (copy-on-write) one, and frees the second location. If an application want to modify the shared location, the system first re-duplicate it, and then modify the newly-created location.
  In practice KSM works surprisingly well, especially for short-lived Windows machines: as Windows has the habit of zeroing all free memory at startup, KSM has plenty of opportunities to coalesce these zeroed pages.  

Conclusions

  It is clear that enabling caching/buffering is very beneficial to the specific workload tested in this article. Cache led to much higher disk usage and, as often the disk subsystem is the weak link of any server, this mean higher potential consolidation ratio.  The story don't ends here, obviously: sometime RAM capacity plays an even bigger role in defining max consolidation ratio. And Linux is very well equipped in this area, thanks to KSM.

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-55770-1-1.html 上篇帖子: 单独编译kvm模块 下篇帖子: [zz]淘宝子团关于kvm 调优的分享
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表