Features/KQemu
Introduction
The QEMU Accelerator (KQEMU) is a driver allowing a user application to run x86 code in a Virtual Machine (VM). The code can be either user or kernel code, in 64, 32 or 16 bit protected mode. KQEMU is very similar in essence to the VM86 Linux syscall call, but it adds some new concepts to improve memory handling.
KQEMU is ported on many host OSes (currently Linux, Windows, FreeBSD, Solaris). It can execute code from many guest OSes (e.g. Linux, Windows 2000/XP) even if the host CPU does not support hardware virtualization.
In that document, we assume that the reader has good knowledge of the x86 processor and of the problems associated with the virtualization of x86 code.
API definition
We describe the version 1.3.0 of the Linux implementation. The implementations on other OSes use the same calls, so they can be understood by reading the Linux API specification.
RAM, Physical and Virtual addresses
KQEMU manipulates three kinds of addresses:
- RAM addresses are between 0 and the available VM RAM size minus one. They are currently stored on 32 bit words.
- Physical addresses are addresses after MMU translation.
- Virtual addresses are addresses before MMU translation.
KQEMU has a physical page table which is used to associate a RAM address or a device I/O address range to a given physical page. It also tells if a given RAM address is visible as read-only memory. The same RAM address can be mapped at several different physical addresses. Only 4 GB of physical address space is supported in the current KQEMU implementation. Hence the bits of order >= 32 of the physical addresses are ignored.
RAM page dirtiness
It is very important for the VM to be able to tell if a given RAM page has been modified. It can be used to optimize VGA refreshes, to flush a dynamic translator cache (when used with QEMU), to handle live migration or to optimize MMU emulation.
In KQEMU, each RAM page has an associated dirty byte in the array init_params.ram_dirty. The dirty byte is set to 0xff if the corresponding RAM page is modified. That way, at most 8 clients can manage a dirty bit in each page.
KQEMU reserves one dirty bit 0x04 for its internal use.
The client must notify KQEMU if some entries of the array init_params.ram_dirty were modified from 0xff to a different value. The address of the corresponding RAM pages are stored by the client in the array init_parms.ram_pages_to_update.
The client must also notify KQEMU if a RAM page has been modified independently of the init_params.ram_dirty state. It is done with the init_params.modified_ram_pages array.
Symmetrically, KQEMU notifies the client if a RAM page has been modified with the init_params.modified_ram_pages array. The client can use this information for example to invalidate a dynamic translation cache.
`/dev/kqemu' device
A user client wishing to create a new virtual machine must open the device `/dev/kqemu'. There is no hard limit on the number of virtual machines that can be created and run at the same time, except for the available memory.
KQEMU_GET_VERSION ioctl
It returns the KQEMU API version as an int. The client must use it to determine if it is compatible with the KQEMU driver.
KQEMU_INIT ioctl
Input parameter: struct kqemu_init init_params
It must be called once to initialize the VM. The following structure is used as input parameter:
struct kqemu_init { uint8_t *ram_base; uint64_t ram_size; uint8_t *ram_dirty; uint64_t *pages_to_flush; uint64_t *ram_pages_to_update; uint64_t *modified_ram_pages; };
The pointers ram_base, ram_dirty, phys_to_ram_map, pages_to_flush, ram_pages_to_update and modified_ram_pages must be page aligned and must point to user allocated memory.
On Linux, due to a kernel bug related to memory swapping, the corresponding memory must be mmaped from a file. We plan to remove this restriction in a future implementation.
ram_size must be a multiple of 4K and is the quantity of RAM allocated to the VM.
ram_base is a pointer to the VM RAM. It must contain at least ram_size bytes.
ram_dirty is a pointer to a byte array of length ramsize/4096. Each byte indicates if the corresponding VM RAM page has been modified (see section 2.2 RAM page dirtiness)
pages_to_flush is a pointer to the first element of an array of KQEMU_MAX_PAGES_TO_FLUSH longs. It is used to indicate which TLB must be flushed before executing code in the VM.
ram_pages_to_update is a pointer to the first element of an array of KQEMU_MAX_RAM_PAGES_TO_UPDATE longs. It is used to notify the VM that some RAM pages have been dirtied.
modified_ram_pages is a pointer to the first element of an array of KQEMU_MAX_MODIFIED_RAM_PAGES longs. It is used to notify the VM or the client that RAM pages have been modified.
The value 0 is return if the ioctl succeeded.
KQEMU_SET_PHYS_MEM ioctl
The following structure is used as input parameter:
struct kqemu_phys_mem { uint64_t phys_addr; uint64_t size; uint64_t ram_addr; uint32_t io_index; uint32_t padding1; };
The ioctl modifies the internal KQEMU physical to ram mappings. After the ioctl is executed, the physical address range [phys_addr; phys_addr + size[ is mapped to the RAM addresses [ram_addr; ram_addr + size[ if io_index is KQEMU_IO_MEM_RAM or KQEMU_IO_MEM_ROM. If KQEMU_IO_MEM_ROM is used, the writes to the RAM are ignored.
When io_index is KQEMU_IO_MEM_UNASSIGNED, it means the physical memory range corresponds to a device I/O region. When a memory access is done to it, KQEMU_EXEC returns with cpu_state.retval set to KQEMU_RET_SOFTMMU.
KQEMU_MODIFY_RAM_PAGE ioctl
Input parameter: int nb_pages
Notify the VM that nb_pages RAM pages were modified. The corresponding RAM page addresses are written by the client in the init_state.modified_ram_pages array given with the KQEMU_INIT ioctl.
Note: This ioctl does currently nothing, but the clients must use it for later compatibility.
KQEMU_EXEC ioctl
Input/Output parameter: struct kqemu_cpu_state cpu_state
Structure definitions:
struct kqemu_segment_cache { uint16_t selector; uint16_t padding1; uint32_t flags; uint64_t base; uint32_t limit; uint32_t padding2; }; struct kqemu_cpu_state { uint64_t regs[16]; uint64_t eip; uint64_t eflags; struct kqemu_segment_cache segs[6]; /* selector values */ struct kqemu_segment_cache ldt; struct kqemu_segment_cache tr; struct kqemu_segment_cache gdt; /* only base and limit are used */ struct kqemu_segment_cache idt; /* only base and limit are used */ uint64_t cr0; uint64_t cr2; uint64_t cr3; uint64_t cr4; uint64_t a20_mask; /* sysenter registers */ uint64_t sysenter_cs; uint64_t sysenter_esp; uint64_t sysenter_eip; uint64_t efer; uint64_t star; uint64_t lstar; uint64_t cstar; uint64_t fmask; uint64_t kernelgsbase; uint64_t tsc_offset; uint64_t dr0; uint64_t dr1; uint64_t dr2; uint64_t dr3; uint64_t dr6; uint64_t dr7; uint8_t cpl; uint8_t user_only; uint16_t padding1; uint32_t error_code; /* error_code when exiting with an exception */ uint64_t next_eip; /* next eip value when exiting with an interrupt */ uint32_t nb_pages_to_flush; int32_t retval; uint32_t nb_ram_pages_to_update; uint32_t nb_modified_ram_pages; };
Execute x86 instructions in the VM context. The full x86 CPU state is defined in this structure. It contains in particular the value of the 8 (or 16 for x86_64) general purpose registers, the contents of the segment caches, the RIP and EFLAGS values, etc...
If cpu_state.user_only is 1, a user only emulation is done. cpu_state.cpl must be 3 in that case.
KQEMU_EXEC does the following:
- Update the internal dirty state of the cpu_state.nb_ram_pages_to_update RAM pages from the array init_params.ram_pages_to_update. If cpu_state.nb_ram_pages_to_update has the value KQEMU_RAM_PAGES_UPDATE_ALL, it means that all the RAM pages may have been dirtied. The array init_params.ram_pages_to_update is ignored in that case.
- Update the internal KQEMU state by taking into account that the cpu_state.nb_modified_ram_pages RAM pages from the array init_params.modified_ram_pages where modified by the client.
- Flush virtual CPU TLBs corresponding to the virtual address from the array init_params.pages_to_flush of length cpu_state.nb_pages_to_flush. If cpu_state.nb_pages_to_flush is KQEMU_FLUSH_ALL, all the TLBs are flushed. The array init_params.pages_to_flush is ignored in that case.
- Load the virtual CPU state from cpu_state.
- Execute some code in the VM context.
- Save the virtual CPU state into cpu_state.
- Indicate the reason for which the execution was stopped in cpu_state.retval.
- Update cpu_state.nb_pages_to_flush and init_params.pages_to_flush to notify the client that some virtual CPU TLBs were flushed. The client can use this notification to synchronize its own virtual TLBs with KQEMU.
- Set cpu_state.nb_ram_pages_to_update to 1 if some RAM dirty bytes were transitionned from dirty (0xff) to a non dirty value. Otherwise, cpu_state.nb_ram_pages_to_update is set to 0.
- Update cpu_state.nb_modified_ram_pages and init_params.modified_ram_pages to notify the client that some RAM pages were modified.
cpu_state.retval indicate the reason why the execution was stopped:
KQEMU_RET_EXCEPTION | n
The virtual CPU raised an exception and KQEMU cannot handle it. The exception number n is stored in the 8 low order bits. The field cpu_state.error_code contains the exception error code if it is needed. It should be noted that in user only emulation, KQEMU handles no exceptions by itself.
KQEMU_RET_INT | n
(user only emulation) The virtual CPU generated a software interrupt (INT instruction for example). The exception number n is stored in the 8 low order bits. The field cpu_state.next_eip contains value of RIP after the instruction raising the interrupt. cpu_state.eip contains the value of RIP at the intruction raising the interrupt.
KQEMU_RET_SOFTMMU
The virtual CPU could not handle the current instruction. This is not a fatal error. Usually the client just needs to interpret it. It can happen because of the following reasons:
* memory access to an unassigned address or unknown device type ; * an instruction cannot be accurately executed by KQEMU (e.g. SYSENTER, HLT, ...) ; * more than KQEMU_MAX_MODIFIED_RAM_PAGES were modified ; * some unsupported bits were modified in CR0 or CR4 ; * GDT.base or LDT.base are not a multiple of 8 ; * the GDT or LDT tables were modified while CPL = 3 ; * EFLAGS.VM was set.
KQEMU_RET_INTR
A signal from the OS interrupted KQEMU.
KQEMU_RET_SYSCALL
(user only emulation) The SYSCALL instruction was executed. The field cpu_state.next_eip contains value of RIP after the instruction. cpu_state.eip contains the RIP of the intruction.
KQEMU_RET_ABORT
An unrecoverable error was detected. This is usually due to a bug in KQEMU, so it should never happen !
KQEMU inner working and limitations
Inner working
The main priority when implementing KQEMU was simplicity and security. Unlike other virtualization systems, it does not do any dynamic translation nor code patching.
- KQEMU always executes the target code at CPL = 3 on the host processor. It means that KQEMU can use the page protections to ensure that the VM cannot modify the host OS nor the KQEMU monitor. Moreover, it means that KQEMU does not need to modify the segment limits to ensure memory protection. Another advantage is that this methods works with 64 bit code too.
- KQEMU maintains a shadow page table simulating the TLBs of the virtual CPU. The shadow page table persists between calls to KQEMU_EXEC.
- When the target CPL is 3, the target GDT and LDT are copied to the host GDT and LDT so that the LAR and LSL instructions return a meaningful value. This is important for 16 bit code.
- When the target CPL is different to 3, the host GDT and LDT are cleared so that any segment loading causes a General Protection Fault. That way, KQEMU can intercept every segment loading.
- All the code running with EFLAGS.IF = 0 is interpreted so that EFLAGS.IF can be accurately reset in the VM. Fortunately, moderns OSes tend to execute very little code with interrupt disabled.
- KQEMU maintains dirty bits for every RAM pages so that modified RAM pages can be tracked. It it useful to know if the GDT and LDT are modified in user mode, and will be useful later to optimize shadow page tables switching. It is also useful to maintain the coherency of the user space QEMU translation cache.
General limitations
Note 1: KQEMU does not currently use the hardware virtualization features of newer x86 CPUs. We expect that the limitations would be different in that case.
Note 2: KQEMU supports both x86 and x86_64 CPUs.
Before entering the VM, the following conditions must be satisfied :
- CR0.PE = 1 (protected mode must be enabled)
- CR0.MP = 1 (native math support)
- CR0.WP = 1 (write protection for user pages)
- EFLAGS.VM = 0 (no VM86 support)
- At least 8 consecutive GDT descriptors must be available (currently at a fixed location in the GDT).
- At least 32 MB of virtual address must be free (currently at a fixed location).
- All the pages containing the LDT and GDT must be RAM pages.
If EFLAGS.IF is set, the following assumptions are made on the executing code:
- If EFLAGS.IOPL = 3, EFLAGS.IOPL = 0 is returned in EFLAGS.
- POPF cannot be used to clear EFLAGS.IF
- RDTSC returns host cycles (could be improved if needed).
- The values returned by SGDT, SIDT, SLDT are invalid.
- Reading CS.rpl and SS.rpl always returns 3 regardless of the CPL.
- in 64 bit mode with CPL != 3, reading SS.sel does not give 0 if the OS stored 0 in it.
- LAR, LSL, VERR, VERW return invalid results if CPL != 3.
- The CS and SS segment cache must be consistent with the descriptor tables.
- The DS, ES, FS, and GS segment cache must be consistent with the descriptor tables for CPL = 3.
- Some rarely used intructions trap to the user space client (performance issue).
If eflags.IF if reset the code is interpreted, so the VM code can be accurately executed. Some intructions trap to the user space emulator because the interpreter does not handle them. A limitation of the interpreter is that currently segment limits are not always tested.
Security
The VM code is always run with CPL = 3 on the host, so the VM code has no more priviliedge than regular user code.
The MMU is used to protect the memory used by the KQEMU monitor. That way, no segment limit patching is necessary. Moreover, the guest OS is free to use any virtual address, in particular the ones near the start or the end of the virtual address space. The price to pay is that CR3 must be modified at every emulated system call because different page tables are needed for user and kernel modes.
Developments Ideas
- Instead of interpreting the code when IF=0, compile it dynamically. The dynamic compiler itself can be implemented in user space, so the kernel module would be simplified.
- Use APIs closer to KVM.
- Optimization of the page table shadowing. A shadow page table cache could be implemented by tracking the modification of the guest page tables. The exact performance gains are difficult to estimate because the tracking itself would introduce some performance loss.
- Support of guest SMP. There is no particular problem except when a RAM page must be unlocked because the host has not enough memory. This particular case needs specific Inter Processor Interrupts (IPI).
- Dynamic relocation of the monitor code so that a 32 MB hole in the guest address space is found automatically without making assumptions on the guest OS.