Virtualization in Xen 3.0(zz)
By Rami Rosenon Thu, 2006-03-02 02:00.SoftwareDive into the new Xen release and find out what itoffers for paravirtualization, split drivers and Intel's newvirtualization technology.
Editor's Note: This article has been updated since its original posting.
Virtualization has existed for over 40 years. Back in the 1960s, IBMdeveloped virtualization support on a mainframe. Since then, manyvirtualization projects have become available for UNIX/Linux and otheroperating systems, including VMware, FreeBSD Jail, coLinux, Microsoft'sVirtual PC and Solaris's Containers and Zones.
The problem with these virtualization solutions is low performance. The Xen Project,however, offers impressive performance results--close to native--andthis is one of its key advantages. Another impressive feature is livemigration, which I discussed in a previous article. After much anticipation, Version 3.0 of Xen recently was released, and it is the focus of this article.
The main goal of Xen is achieving better utilization of computerresources and server consolidation by way of paravairtualization andvirtual devices. Here, we discuss how Xen 3.0 implements these ideas.We also investigate the new VT-x/VT-i processors from Intel, which havebuilt-in support for virtualization, and their integration into Xen.
Paravirtualization
The idea behind Xen is to run guest operating systems not in ring 0,but in a higher and less privileged ring. Running guest OSes in a ringhigher than 0 is called "ring deprivileging". The default Xeninstallation on x86 runs guest OSes in ring 1, termed Current PrivilegeLevel 1 (or CPL 1) of the processor. It runs a virtual machine monitor(VMM), the "hypervisor", in CPL 0. The applications run in ring 4without any modification.
http://www.linuxjournal.com/articles/web/2006-03/8909/8909f1.png About 250 instructions are contained in the IA-32 instruction set,of which 17 are problematic in terms of running them in ring 1. Theseinstructions can be problematic in two senses. First, running theinstruction in ring 1 can cause a general protection exception (GPE),which also may be called a general protection fault (GPF). For example,running HLTimmediately causes a GPF. Some instructions, such as CLIand STI,may can cause a GPF if a certain condition is met. That is, a GPFoccurs if the CPL is greater than the IOPL of the current program orprocedure and, as a result, has less privilege.
The second problem occurs with instructions that do not cause a GPFbut still fail. Many Xen articles use the term "fail silently" todescribe thess cases. For example, the POPF at the restored EFLAGS hasa different interrupt flag (IF) value than the current EFLAGS.
How does Xen handles these problematic instructions? In some cases,such as the HLT instruction, the instruction in ring 1--where the guestOSes run--is replaced by a hypercall. For example, considersparse/arch/xen/i386/kernel/process.c in the cpu_idle() method. Insteadof calling the HLT instruction, as is done eventually in the Linuxkernel, we call the xen_idle() method. It performs a hypercall instead,namely, the HYPERVISOR_sched_op(SCHEDOP_block, 0) hypercall.
A hypercall is Xen's analog to a Linux system call. A system call isan interrupt (0x80) called in order to move from user space (CPL3) tokernel space (CPL0). A hypercall also is an interrupt (0x82). It passescontrol from ring 1, where the guest domains run, to ring 0, where Xenruns. The implementation of a system call and a hypercall is quitesimilar. Both pass the number of the syscall/hypercall in the eaxregister. Passing other parameters is done in the same way. Inaddition, both the system call table and the hypercall table aredefined in the same file, entry.S.
You can batch some hypercalls into one multicall by building anarray of hypercalls. You can do this by using a multicall_entry_tstruct. You then can use one hypercall, HYPERVISOR_multicall. This way,the number of entries to and exits from the hypervisor is reduced. Ofcourse, reducing such interprivilege transitions when possible resultsin better performance. The netback virtual drivers, for example, usesthis multicall mechanism.
Here's another example: the CLTS instruction clears the task switch(TS) flag in CR0. This instruction causes a GPF, however, when issuedin ring 1, as is the case with HLT. But the CLTS instruction itself isnot replaced by some hypercall. Instead, it is delegated to ring 0 inthe following way. When it is issued in ring 1, we get a GPF. But thisGPF is handled by do_general_protection(), located inxen/arch/x86/traps.c. Note, though, that do_general_protection() is thehypervisor handler, which runs in ring 0. From there,do_general_protection() calls do_fpu_taskswitch(). Under certaincircumstances, this handler scans the opcode of the instructionsreceived in the CPU. In the case of CLTS, where the opcode is 0x06, itcalls do_fpu_taskswitch(0). Eventually, do_fpu_taskswitch(0) calls theCLTS instruction, but this time it is called from ring 0. Note: be sure_VCPUF_fpu_dirtied is set to enable this.
Those who are curious about further details can look at theemulate_privileged_op() method in that same file, xen/arch/x86/traps.c.The instructions that may "fail silently" usually are replaced byothers.
Virtual Split Drivers
The idea behind split devices is safe hardware isolation. Domain 0is the only one that has direct access to the hardware devices, and ituses the original Linux drivers. But domain 0 has another layer, thebackend, that contains netback and blockback virtual drivers. (On aside note, support for usbback will be added in the future, and work onthe USB layer is being done by Harry Butterworth.)
Similarly, the unprivileged domains have access to a frontend layer,which consists of netfront and blockfront virtual drivers. Theunprivileged domains issue I/O requests to the frontend in the same waythat I/O requests are sent to an ordinary Linux kernel. However,because the frontend is only a virtual interface with no access to realhardware, these requests are delegated to the backend. From there theyare sent to the real devices.
http://www.linuxjournal.com/articles/web/2006-03/8909/8909f2.inline.png When an unprivileged domain is created, it creates an interdomainevent channel between itself and domain 0. This is done with theHYPERVISOR_event_channel_op hypercall, where the command is EVTCHNOP_bind_interdomain.In the case of the network virtual drivers, the event channel iscreated by netif_map() in sparse/drivers/xen/netback/interface.c. Theevent channel is a lightweight channel for passing notifications, suchas saying when an I/O operation has completed.
A shared memory area exists between each guest domain and domain 0.This shared memory is used to pass requests and data. The shared memoryis created and handled using the grant tables API.
When an interrupt is asserted by the controller, the APIC, we arriveat the do_IRQ() method, which also can be found in the Linux kernel(arch/x86/irq.c). The hypervisor handles only timer and serialinterrupts. Other interrupts are passed to the domains by calling__do_IRQ_guest(). In fact, the IRQ_GUEST flag is set for all interrupts except for timer and serial interrupts.
__do_IRQ_guest() sends the interrupt by calling send_guest_pirq()for all guests registered on this IRQ. The send_guest_pirq() creates anevent channel--an instance of evtchn--and sets the pending flag of thisevent channel by calling evtchn_set_pending(). Then, asynchronously,Xen notifies this domain of the interrupt, and it is handledappropriately.
Xen and the New Intel VT-x Processors
Intel currently is developing the VT-x and VT-itechnologies for x86 and Itanium processors, respectively, which will provide virtualization extensions. Support for the VT-x/VT-i extensions is partof the Xen 3.0 official code; it can be found in xen/arch/x86/vmx*.c.,xen/include/asm-x86/vmx*.h and xen/arch/x86/x86_32/entry.S.
The most important structure in Xen's implementation of VT-x/VT-i isthe VMCS (vmcs_struct in the code), which represents the VMCS region.The VMCS region contains six logical regions; most relevant to ourdiscussion are the Guest-state area and Host-state area. The other fourregions are VM-execution control fields, VM-exit control fields,VM-entry control fields and VM-exit information fields.
Intel added 10 new opcodes in VT-x/VT-i to support Intel Virtualization Technology. Let's take a look at the new opcodes and their wrappers in the code:
[*] VMCALL: (VMCALL_OPCODE in vmx.h) This simply calls the VM monitor, causing the VM to exit.
[*] VMCLEAR:(VMCLEAR_OPCODE in vmx.h) copies VMCS data to memory in case it iswritten there. wrapper: _vmpclear (u64 addr) in vmx.h.
[*] VMLAUNCH:(VMLAUNCH_OPCODE in vmx.h) launches a virtual machine, and changes thelaunch state of the VMCS to be launched, if it is clear.
[*] VMPTRLD: (VMPTRLD_OPCODE in vmx.h) loads a pointer to the VMCS. wrapper: _vmptrld (u64 addr) in vmx.h
[*] VMPTRST: (VMPTRST_OPCODE in vmx.h) stores a pointer to the VMCS. wrapper: _vmptrst (u64 addr) in vmx.h.
[*] VMREAD: (VMREAD_OPCODE in vmx.h) read specified field from VMCS. wrapper: _vmread(x, ptr) in vmx.h
[*] VMRESUME:(VMRESUME_OPCODE in vmx.h) resumes a virtual machine. In order it toresume the VM, the launch state of the VMCS should be "clear".
[*] VMWRITE: (VMWRITE_OPCODE in vmx.h) write specified field in VMCS. wrapper _vmwrite (field, value).
[*] VMXOFF: (VMXOFF_OPCODE in vmx.h) terminates VMX operation. wrapper: _vmxoff (void) in vmx.h.
[*] VMXON: (VMXON_OPCODE in vmx.h) starts VMX operation. wrapper: _vmxon (u64 addr) in vmx.h.
When using this technology, Xen runs in VMX root operation mode. The guest domains, which are unmodified OSes, run in VMX non-root operation mode. Because the guest domains run in non-root operation mode, they are more restricted, meaning that certain actions cause a VM exit to occur.
Xen enters the VMX operation in start_vmx() method, xen/arch/x86/vmx.c. This method is called from init_intel() method in xen/arch/x86/cpu/intel.c.; CONFIG_VMX should be defined.
First, we check the X86_FEATURE_VMXE bit in the ecx register to seeif the cpuid shows support for VMX in the processor. For IA-32, Inteladded a part to the CR4 control register that specifies whether we want to enable VMX. Therefore, we must set this bit to enable VMX on the processor by calling set_in_cr4(X86_CR4_VMXE). It is bit 13 in CR4 (VMXE).
We then call _vmxon to start the VMX operation. If we try to startthe VMX operation with _vmxon when the VMXE bit in CR4 is not set, weget an #UD exception, telling us we have an undefined opcode.
Some instructions can cause VM to exit unconditionally, and some cancause VM to exit certain VM-execution control fields. (See thediscussion about the VMX region above.) The following instructionscause VM to exit unconditionally: CPUID, INVD, MOV from CR3, RDMSR,WRMSR and all the new VT-x instructions listed above. Otherinstructions, such as HLT, INVPLG (invalidate TLB entry instruction),MWAIT and others, cause a VM exit if a corresponding VM-executioncontrol was set.
Apart from VM-execution control fields, two bitmaps are used fordetermining whether to perform a VM exit. The first is the exceptionbitmap (see EXCEPTION_BITMAP in vmcs_field enum in xen/include/asm-x86/vmx_vmcs.h), which is a 32-bitfield. When a bit is set in this bitmap, it causes a VM exit if acorresponding exception occurs. By default, the entries set areEXCEPTION_BITMAP_PG, for page fault, and EXCEPTION_BITMAP_GP, forgeneral protection (see MONITOR_DEFAULT_EXCEPTION_BITMAP in vmx.h).
The second bitmap is the I/O bitmap. In truth, there are two 4KB I/Obitmaps, A and B, which control I/O instructions on various ports. I/Obitmap A contains the ports in the range of 0000-7FFF, and I/O bitmap B contains the ports in the range of 8000-FFFF. (See IO_BITMAP_A and IO_BITMAP_B in vmcs_field enum.)
When a VM exit occurs, we are sent to the vmx_vmexit_handler() invmx.c. We handle the VM exit according to the exit reason provided,which we can see in the VMCS region. There are 43 basic exit reasons;you can find some of them in vmx.h. The fields start with EXIT_REASON_,such as EXIT_REASON_EXCEPTION_NMI (which is exit reason 0) and so on.
When working with VT-x/VT-i, guest operating systems cannot work in real mode. This is the reason why we load the guests with a special loader, the vmxloader. The vmxloader loads ROMBIOS at 0xF0000, VGABIOS at 0xC0000 and then VMXAssist at D000:0000. VMXAssist is an emulator for real mode that uses the virtual-8086 mode of IA32. After setting virtual-8086 mode, the vmxloader executes in a 16-bit environment.
Certain instructions are not recognized in virtual-8086 mode, however, such as LIRT (load interrupt register table) and LGDT (load global descriptor table). When trying to run these instructions in protected mode, they produce #GP(0) errors. VMXAssist checks the opcode of the instructions being executed and handles them so that they do not cause GPFs.
HVM, a Common Interface for VT-x/VT-i and AMD SVM
VT-x/VT-i and AMD's SVM architectures have much in common, which wasthe motivation for developing their common interface layer, theHardware Virtual Machine (HVM). The code for the HVM layer was writtenby Leendert van Doorn from the Watson Research Center at IBM, and itresides in a separate branchin the Xen repository.
An example of a common interface for VT-x/VT-i and AMD SVM is the domain builder, xc_hvm_build(),located in xc_hvm_build.c. Because the loader now is common to botharchitectures, the vmxloader now is called the hvmloader. The hvmloaderidentifies the processor simply by calling its CPUID; seetools/firmware/hvmloader/hvmloader.c.
The AMD SVM has a paged real mode, which virtualizes a real modeinside of a protected mode. So in the case of AMD SVM, we should setoperations to real mode only, SVM_VMMCALL_RESET_TO_REALMODE. In thecase of VT-x/VT-i, we should use VMXAssist, as explained above.
HVM defines a table called hvm_function_table, which is a structurecontaining functions that are common to both VT-x/VT-i and AMD SVM.These methods, including initialize_guest_resources() andstore_cpu_guest_regs(), are implemented differently in VT-x/VT-i andAMD SVM.
Xen 3.0 also includes support for the AMD SVM processor. One ofSVM's benefits is a tagged TLB: guests are mapped to address spacesdifferent from what the VMM sets. The TLB is tagged with address spaceidentifiers (ASIDs), so a TLB flush does not occur when there is acontext switch.
Live Migration
One of the fascinating features of Xen is live migration, which canbe used as a solution for load balancing and maintenance. The downtimewhen using live migration is quite low--tens of milliseconds. Livemigration implementation in Xen is managed by domain 0.
There are two stages to live migration. The first stage is"pre-copying", in which the physical memory is copied to the target byway of TCP while the migrating domain continues to run. After someiterations, during which only the pages that were dirtied from the lastiteration are copied, the migrating domain stops running. Then, in thesecond stage, the remaining pages are copied, and the domain resumesits work on the target machine.
In addition, Jacob Gorm Hansen, from the University of Copenhagen,Denmark, is doing some interesting work on "self migration". In selfmigration, the unprivileged domain being migrated handles the migrationitself. Although there are some benefitsto having this ability, such as security, self migration is morecomplex than live migration. For instance, the memory pages containingthe code that manages the migration are dirtied during the transfer.
Conclusion
In the future, it appears as though all of Intel's new 64-bitprocessors will have virtualization extension support, and Xen seems toadopt mainly CPUs with virtualization support. Currently, Xen hassupport for VT-x and VT-i in the official tree, and a branch in therepository has AMD SVM support.
Overall, Xen is an interesting virtualization project with manyfeatures and benefits. And, there's a chance that Xen will beintegrated into the official Linux kernel tree sometime in the future,as happened with UML and LVS.
Resources
/xen/irc/logs
Rami Rosen is a computer science graduate of Technion, the IsraelInstitute of Technology, located in Haifa. He works as a Linux kernelprogrammer for a networking start-up, and he can be reached at ramirose@gmail.com.In his spare time, he likes running, solving cryptic puzzles andhelping everyone he knows to move to this wonderful operating system,Linux.
页:
[1]