设为首页 收藏本站
查看: 1299|回复: 0

[经验分享] XEN Power Management

[复制链接]

尚未签到

发表于 2016-1-9 13:09:24 | 显示全部楼层 |阅读模式
  目录



  • CPU P-states (cpufreq)


    • Domain0 based cpufreq

    • Hypervissor based cpufreq



  • CPU C-states (cpuidle)


    • cpuidle overview

    • cpuidle in Xen

    • Potential side-effects of cpuidle



  • xenpm - userspace control tool

  • Xen 3.4 improvement in power management

  • Xen hypervisor clocksource= option

  • Note about AMD CPUs

  • Referrence
  This document describes the power management feature of Xen, including the CPU P-States(cpufreq), and CPU C-states (cpuidle).

CPU P-states (cpufreq)
  CPU P state (performance state) is one kind of processor power saving state defined in ACPI spec. CPU P state saves power by changing CPU frequency and voltage. Among the P state P0, P1, … Pn, P0 has the highest frequency and thus the highest power consumption. Pn has the lowest frequency and thus the lowest power consumption. Many processors have their own P-state implementations. The EIST (Enhanced Intel SpeedStep
®
Technology) and AMD PowerNow
!TM
Technology are typical example.
  Xen support this feature by cpufreq driver. The logic is similar as commodity OS, i.e. system software periodically measure system status such as CPU utilization, and figure out the appropriate CPU frequency according to the cpufreq policy and current system status, and finally issue platform-depend command to change the CPU frequency. The logic is as Fig 1.

  Figure1: cpufreq
  For historical reason, Xen has two implementations. The first implementation is domain0 based cpufreq which implements the cpufreq logic in domain0. The second is hypervisor based cpufreq which implements the cpufreq logic in hypervisor. The default is the hypervisor based cpufreq.

Domain0 based cpufreq
  Domain0 based cpufreq reuse the domain0 kernel cpufreq code and let domain0 handle the cpufreq logic. Xen hypervisor provide two two hypercalls, which are platform hypercall XENPF_change_freq and XENPF_getidletime, to assist the domain0 kernel to get the system status and also change the CPU frequency.
  It is simple to enable the domain0 based cpufreq, just add xen boot option “cpufreq= dom0-kernel” in grub entry, example:

title Xen

root (hd0,0)

kernel /boot/xen.gz cpufreq=dom0-kernel

module /boot/vmlinuz-2.6.18.8-xen ro root=/dev/sda1

module /boot/initrd-2.6-xen.img
  Notes: Domain0 based cpufreq has one limitations, i.e. domain0 VCPU number must be the same as the number of physical CPUs, and the domain0 VCPU must be pinned to physical CPU

Hypervissor based cpufreq
  Hypervisor based cpufreq implement most of the cpufreq logic in hypervisor. Figure 2 illustrate the logic:

  Figure2: hypervisor based cpufreq
  Hypervisor
has three components: cpufreq core, cpufreq governor and cpufreq CPU driver. Cpufreq core control the overall logic. Cpufreq governor control the cpufreq policy, i.e. which CPU frequency should CPU goes to. Cpufreq CPU driver issues command to change physical CPU frequency. Xen currently has four governors:


  • ondemand: choose the best frequency which best fit the
  • userspace: choose the frequency that specified by user.
  • performance: select the highest frequency
  • powersave: select the lowest frequency
  the default governor is userspace governor. Xen also supports three CPU drivers: ACPI (IA32) for Intel x86 processor; ACPI (IA64) for Intel Itanium processor and PowerNow
K8 for AMD processor.
  Domain0
has two components: ACPI parser and xenpm tool. ACPI parser parses ACPI table and pass the information to hypervisor cpufreq core. Xenpm tool is a userspace cpufreq control tool, which can select cpufreq governor, specify userspace governor frequency, etc. later XenPM section has more details.
  Usage
: user can use xen boot option "cpufreq=xen" to enable hypervisor based cpufreq. From c/s 18950, it is enabled by default with userspace governor, so no xen boot option is needed after c/s 18950.
  xenpm can be used to control cpfureq. Some typical commands are:


  • set cpufreq governor

# xenpm set-scaling-governor ondemand|performance|powersave

  • get cpufreq P-state status

# xenpm get-cpufreq-states

cpu id        : 0

total P-states    : 4

usable P-states    : 4

current frequency  : 800 MHz

P0          : freq       [2534 MHz]

transition [00000000000000000003]

residency  [00000000000000002668 ms]

P1          : freq       [2533 MHz]

transition [00000000000000000000]

residency  [00000000000000000000 ms]

P2          : freq       [1600 MHz]

transition [00000000000000000000]

residency  [00000000000000000000 ms]

*P3         : freq       [0800 MHz]

transition [00000000000000000003]

residency  [00000000000000000237 ms]

  • get cpufreq parameters

cpu id         : 0

affected_cpus      : *0 1

cpuinfo frequency   : max [2534000] min [800000] cur [800000]

scaling_driver     : acpi-cpufreq

scaling_avail_gov    : userspace performance powersave ondemand

current_governor    : ondemand

ondemand specific :

sampling_rate  : max [10000000] min [10000] cur [20000]

up_threshold  : 80

scaling_avail_freq   : 2534000 2533000 1600000 *800000

scaling frequency   : max [2534000] min [800000] cur [800000]
CPU C-states (cpuidle)

cpuidle overview
  ACPI define processor power state as C-states, which includes: C0, C1, C2, C3, … Cn. C0 is the normal working state where CPU will execute instruction. C1 ~ Cn is the sleeping state where CPU stop executing instruction and power down some internal component to save power. The more deep C state, the more power saving it has, and also the more wakeup latency. Processor in C1~Cn state will be waken up by breaking event (e.g. interrupt) and transition back to C0.
  There are several ways to make processor enter C state:


  • HLT instruction will make processor enter C1 state.
  • Reading some I/O port will make processor enter different C states. The I/O port number is platform specific and can be retrieved from ACPI table.
  • Platform can also provide specific instruction to enter C state. For example, in Intel processor, monitor/mwait instruction pair can also be used to enter C state.

cpuidle in Xen
  Xen also support CPU C-states. The logic can be explained by following two questions:


  •   When to enter C state
    : it is quite straightforward. When one physical CPU has no task (VCPU) assigned, it will run idle vcpu, which in turn will put CPU into C-state. When there is breaking event (e.g interrupt) happen, the CPU will be brought out from C state and back to work.

  •   Which C state to enter
    : it is more complicated. Deeper C state has more power saving, but also more latency. A good algorithm should balance both power saving and performance. Xen use menu governor to select the deepest C state which also satisfy the latency requirement in the mean time.

  How to enable cpuidle
: Since most platform support C1 state (i.e. HLT instruction), Xen will make idle vcpu only enter C1 state by default. If the platform supports more C state other than C1, (e.g. C2, C3), user can add xen boot option “cpuidle” to enable the full C state support. Grub entry example is as follow:

title Xen

root (hd0,0)

kernel /boot/xen.gz cpuidle

module /boot/vmlinuz-2.6.18.8-xen ro root=/dev/sda1

module /boot/initrd-2.6-xen.img
  Tips: how to check if platform support more C state.
After adding “cpuidle” xen boot option, check C state related info in the “'xm dmesg” output, for example:

(XEN) cpu4 cx acpi info:

(XEN)   count = 3

(XEN)   flags: bm_cntl[1], bm_chk[1], has_cst[1],

(XEN)      pwr_setup_done[1], bm_rld_set[0]
  The above “count=3” means the platform supports three non-C0 C-states, e.g. C1, C2, C3. If the count=1, that means platform only support C1, probably “cpuidle” option will have no effect in this case.
  When cpuidle is enabled, xenpm can be used to view the C state status

# xenpm get-cpuidle-states 0

cpu id         : 0

total C-states     : 4

idle time(ms)      : 66629024

C0            : transition [00000000000013725238]

residency  [00000000000000558798 ms]

C1            : transition [00000000000000002263]

residency  [00000000000000000018 ms]

C2            : transition [00000000000000311572]

residency  [00000000000000296852 ms]

C3            : transition [00000000000013411403]

residency  [00000000000066041515 ms]
Potential side-effects of cpuidle
  In xen3.4, cpuidle is enabled by default since c/s 19421. But some side-effects may exist under different H/W C-states implementations or H/W configurations, so that user may occasionally observe latency or system time/tsc skew. This section describes the conditions causing these side-effects and the way to mitigate them.

Latency
  Latency could be caused by two factors: C-state entry/exit latency and extra latency caused by broadcast mechanism.
  C-state entry/exit latency is inevitable since powering on/off gates takes time. Normally shallower C-state incurs lighter latency but less power saving capability, and vice versa for deeper C-state. Cpuidle governor tries to balance performance and power tradeoff in high level, which is one area where we'll continue to tune.
  Broadcast is necessary to handle APIC timer stop at deep C-state (>=C3) on some platforms. One platform timer source is chosen to carry per-cpu timer deadline and then wakeup CPUs in deep C-state timely at expected expiry. By far Xen3.4 supports PIT/HPET as the broadcast source. In current implementation PIT broadcast is implemented in periodical mode (10ms) which means up to 10ms extra latency could be added on expiry expected from sleep CPUs. This is just initial implementation choice which of course could be enhanced to on-demand on/off mode in the future. We didn't go into that complexity in current implementation, due to its slow access and also short wrap count. So HPET broadcast is always preferred, once this facility is available which adds negligible overhead with timely wakeup. Then... world is not always perfect, and some side-effects also exist along with HPET.
  Detail is listed as below:

1. For h/w supporting ACPI C1 (halt) only (BIOS reported in ACPI _CST method)
  It's immune from this side-effect as only instruction execution is halted.

2. For h/w supporting ACPI C2 in which TSC and apic timer don't stop
  ACPI C2 type is a bit special which is sometimes alias to a deep CPU C-state and thus current Xen3.4 treat ACPI C2 type in same manner as ACPI C3 type (i.e. broadcast is activated). If user knows on that platform ACPI C2 type has not that h/w limitation, 'lapic_timer_c2_ok' could be added in grub to deactivate software mitigation.

3. For the rest implementations support ACPI C2+ in which apic timer will be stopped:

3.1 HPET as broadcast timer source (clocksource)
  HPET can delivery timely wakeup event to CPUs sleep in deep C-states with negligible overhead, as stated earlier. But HPET mode being used does make some differences to worthy of our noting:


  • If h/w supports per-channel MSI delivery mode (intr via FSB), it's the best broadcast mechanism known so far. No side effect regarding to latency, and IPIs used to broadcast wakeup event could be reduced by a factor of number of available channels (each channel could independently serve one or several sleeping CPUs). As long as this feature is available, it's always first preferred automatically
  • When MSI delivery mode is absent, we have to use legacy replacement mode with only one HPET channel available. Well, it's not that bad as this only one channel could serve all sleeping CPUs by using IPIs to wake up. However another side-effect occurs, as PIT/RTC interrupts (IRQ0/IRQ8) are replaced by HPET channel. Then RTC alarm feature in dom0 will be lost, unless we add RTC emulation between dom0's rtc module and Xen's hpet logic (however,it's not implemented by far.)
  Due to above side-effect, this broadcast option was disabled by default in the past. In that case, PIT broadcast was the default. If user is sure that he does not need RTC alarm, then use 'hpetbroadcast' grub option to force enabling it.
  The recent improvement is that the legacy mode HPET becomes the default broadcast source if MSI mode is absent. If user try to use RTC alarm, the 'max_cstate' will be automatically limited to C1 (or C2 if giving 'lapic_timer_c2_ok') which doesn't need broadcast, and the legacy mode HPET channel will be disabled to make RTC interrupts available. If user doesn't care the RTC alarm and doesn't want unknown RTC alarm usage to trigger the cpuidle auto degrading, please specify 'hpetbroadcast' grub option.

3.2 PIT as broadcast timer source (clocksource)
  In the past, if MSI based HPET intr delivery is not available or HPET is missing, in all cases PIT broadcast was the current default one. As said earlier, 10ms periodical mode is implemented on PIT broadcast which thus could incur up to 10ms latency for each deep C-state entry/exit. One natural result is to observe 'many lost ticks' in some guests.
  Currently, by default cpuidle will be selectively disabled while HPET is not available. So for system only with PIT, the cpuidle is still disabled by default. If user really wants cpuidle feature with PIT as broadcast source, please specify 'cpuidle' grub option.

4. Recommendations
  So, if user doesn't care about power consumption while his platform does expose deep C-states, one mitigation is to add 'max_cstate=' boot option to restrict maximum allowed C-states (If limited to C2, ensure adding 'lapic_timer_c2_ok' if applied). Runtime modification on 'max_cstate' is allowed by xenpm.
  If user does care about power consumption w/o requirement on RTC alarm, then always using HPET (specify 'hpetbroadcast') is preferred.
  Last, we could either add RTC emulation on HPET or enhance PIT broadcast to use single shot mode, but would like to see comments from community whether it's worthy of doing.

System time/TSC skew
  Similarly to APIC timer stop, TSC is also stopped at deep C-states in some implementations, which thus requires Xen to recover lost counts at exit from deep C-state by software means. It's easy to think kinds of errors caused by software methods. For the detail how TSC skew could occur, its side effects and possible solutions, you could refer to our Xen summit presentation:XenSummit09pm.pdf

  Below is the brief introduction about which algorithm is available in different implementations:


  • Best case is to have non-stop TSC at h/w implementation level. For example, Intel Core-i7 processors supports this green feature which could be detected by CPUID. Xen will do nothing once this feature is detected, and thus no extra software-caused skew besides dozens of cycles due to crystal drift.
  • If TSC frequency is invariant across freq/voltage scaling (true for all Intel processors supporting VTx) and all processors within the box share one crystal, please give boot option 'consistent_tscs'. With this option, Xen will sync AP's TSCs to BSP's at 1 second interval in per-cpu time calibration, meanwhile do recover in a per-cpu style, where only elapsed platform counter since last calibration point is compensated to local TSC with a boot-time-calculated scale factor. This global synchronization along with per-cpu compensation limits TSC skew to ns level in most cases.
  • If TSC frequency is variant across freq/voltage scaling or boot option 'consistent_tscs' is not specified, Xen will only do recover in a per-cpu style, where only elapsed platform counter since last calibration point is compensated to local TSC with local scale factor. In such manner TSC skew across cpus is accumulating and easy to be observed after system is up for some time.
  Recommendations

  Once you observe obvious system time/TSC skew, and you don't care power consumption specially, then similar to handle broadcast latency:
  Limit 'max_cstate' to C1 or limit 'max_cstate' to a real C2 and give 'lapic_timer_c2_ok' option.
  Or, if you care about power, and sure that the processor TSC frequency is invariant and all processors within the box share one crystal, give boot option 'consistent_tscs'.
  Or, better to run your work on a newer platform with either constant TSC frequency or no-stop TSC feature supported.

xenpm - userspace control tool
  Xenpm is a userspace tool that can list the power information of available processors and control the power policy according to users' preference. Its name stands for Xen Power Management(xenpm for short).
  xenpm usage is as follow

usage: xenpm  [args]

xenpm command list:

get-cpuidle-states   [cpuid]    list cpu idle info of CPU  or all

get-cpufreq-states   [cpuid]    list cpu freq info of CPU  or all

get-cpufreq-para    [cpuid]    list cpu freq parameter of CPU  or all

set-scaling-maxfreq  [cpuid]  set max cpu frequency  on CPU

or all CPUs

set-scaling-minfreq  [cpuid]  set min cpu frequency  on CPU

or all CPUs

set-scaling-speed   [cpuid]  set scaling speed on CPU  or all

it is used in userspace governor.

set-scaling-governor  [cpuid]  set scaling governor on CPU  or all

as userspace/performance/powersave/ondemand

set-sampling-rate   [cpuid]  set sampling rate on CPU  or all

it is used in ondemand governor.

set-up-threshold    [cpuid]  set up threshold on CPU  or all

it is used in ondemand governor.

start         start      collect Cx/Px statistics,

output after CTRL-C or SIGINT.
  basically it has four functions:


  • Show C-state status with sub-command "get-cpuidle-states"
  • Show P-state status with sub-command "get-cpufreq-states"
  • Control cpufreq parameter, with sub-command "get-cpufreq-para", "set-scaling-maxfreq", "set-scaling-minfreq", "set-scaling-speed", "set-scaling-governor", "set-sampling-rate", "set-up-threshold"
  • Sampling C/P state status in specified period, with sub-command "start"
  the first three functions have already been illustrated in the previous cpufreq and cpuidle section, the fourth function aims to provide a overall picture of PM activity in specified period. For example, the follow command measure the C/P activity in 5 minutes

# xenpm start &  sleep 5; killall -s SIGINT xenpm

Elapsed time (ms): 5028

CPU0:

residency       percentage

C0    59 ms   1.18%

C1    0 ms    0.00%

C2    56 ms   1.13%

C3    4912 ms 97.69%

P0    0 ms    0.00%

P1    0 ms    0.00%

P2    0 ms    0.00%

P3    7 ms    100.00%

CPU1:

residency       percentage

C0    118 ms  2.36%

C1    0 ms    0.00%

C2    36 ms   0.73%

C3    4873 ms 96.91%

P0    0 ms    0.00%

P1    0 ms    0.00%

P2    0 ms    0.00%

P3    63 ms   100.00%
Xen 3.4 improvement in power management


  • Better support to deep C-states with APIC timer/tsc stop
  • More efficient cpuidle 'menu' governor
  • More cpufreq governors (performance, userspace, powersave, ondemand) and drivers supported
  • Enhanced xenpm tool to monitor and control Xen power management activities
  • MSI-based HPET delivery, with less broadcast traffic when cpus are in deep C-states
  • Power aware option for credit scheduler - sched_smt_power_savings
  • Timer optimization for reduced break events (range timer, vpt align)

Xen hypervisor clocksource= option
  Some people (running AMD based systems) have noticed high ntpd jitter/noise in dom0 when running Xen hypervisor (xen.gz) with "clocksource=hpet" option, but running with "clocksource=acpi" or "clocksource=pit" works fine without jitter for them.
  You can check the current Xen hypervisor clocksource being used by running "xm dmesg | grep -i timer" and check for messages like:
  "(XEN) Platform timer is 3.579MHz ACPI PM Timer" or
  "(XEN) Platform timer is 1.193MHz PIT" or
  "(XEN) Platform timer is 14.284MHz HPET".
  You can change/set the timer by configuring grub.conf bootloader and specifying "clocksource=hpet/acpi/pit" for Xen hypervisor (xen.gz). Note that clocksource= is not dom0 kernel option!

Note about AMD CPUs
  Xen 3.4 and 4.0 supports power management on AMD family >= 0x10h, but NOT on family 15 (0xf) CPUs.

Referrence
  1. ACPI: Advanced Configuration & Power Interface (http://www.acpi.info/
)

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-162191-1-1.html 上篇帖子: ubuntu 14.04 安装xen 下篇帖子: centos 卸载 xen
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表