Contents |
The QEMU Accelerator (KQEMU) is a driver allowing a user application to run x86 code in a Virtual Machine (VM). The code can be either user or kernel code, in 64, 32 or 16 bit protected mode. KQEMU is very similar in essence to the VM86 Linux syscall call, but it adds some new concepts to improve memory handling.
KQEMU is ported on many host OSes (currently Linux, Windows, FreeBSD, Solaris). It can execute code from many guest OSes (e.g. Linux, Windows 2000/XP) even if the host CPU does not support hardware virtualization.
In that document, we assume that the reader has good knowledge of the x86 processor and of the problems associated with the virtualization of x86 code.
We describe the version 1.3.0 of the Linux implementation. The implementations on other OSes use the same calls, so they can be understood by reading the Linux API specification.
KQEMU manipulates three kinds of addresses:
KQEMU has a physical page table which is used to associate a RAM address or a device I/O address range to a given physical page. It also tells if a given RAM address is visible as read-only memory. The same RAM address can be mapped at several different physical addresses. Only 4 GB of physical address space is supported in the current KQEMU implementation. Hence the bits of order >= 32 of the physical addresses are ignored.
It is very important for the VM to be able to tell if a given RAM page has been modified. It can be used to optimize VGA refreshes, to flush a dynamic translator cache (when used with QEMU), to handle live migration or to optimize MMU emulation.
In KQEMU, each RAM page has an associated dirty byte in the array init_params.ram_dirty. The dirty byte is set to 0xff if the corresponding RAM page is modified. That way, at most 8 clients can manage a dirty bit in each page.
KQEMU reserves one dirty bit 0x04 for its internal use.
The client must notify KQEMU if some entries of the array init_params.ram_dirty were modified from 0xff to a different value. The address of the corresponding RAM pages are stored by the client in the array init_parms.ram_pages_to_update.
The client must also notify KQEMU if a RAM page has been modified independently of the init_params.ram_dirty state. It is done with the init_params.modified_ram_pages array.
Symmetrically, KQEMU notifies the client if a RAM page has been modified with the init_params.modified_ram_pages array. The client can use this information for example to invalidate a dynamic translation cache.
A user client wishing to create a new virtual machine must open the device `/dev/kqemu'. There is no hard limit on the number of virtual machines that can be created and run at the same time, except for the available memory.
It returns the KQEMU API version as an int. The client must use it to determine if it is compatible with the KQEMU driver.
Input parameter: struct kqemu_init init_params
It must be called once to initialize the VM. The following structure is used as input parameter:
struct kqemu_init {
uint8_t *ram_base;
uint64_t ram_size;
uint8_t *ram_dirty;
uint64_t *pages_to_flush;
uint64_t *ram_pages_to_update;
uint64_t *modified_ram_pages;
};
The pointers ram_base, ram_dirty, phys_to_ram_map, pages_to_flush, ram_pages_to_update and modified_ram_pages must be page aligned and must point to user allocated memory.
On Linux, due to a kernel bug related to memory swapping, the corresponding memory must be mmaped from a file. We plan to remove this restriction in a future implementation.
ram_size must be a multiple of 4K and is the quantity of RAM allocated to the VM.
ram_base is a pointer to the VM RAM. It must contain at least ram_size bytes.
ram_dirty is a pointer to a byte array of length ramsize/4096. Each byte indicates if the corresponding VM RAM page has been modified (see section 2.2 RAM page dirtiness)
pages_to_flush is a pointer to the first element of an array of KQEMU_MAX_PAGES_TO_FLUSH longs. It is used to indicate which TLB must be flushed before executing code in the VM.
ram_pages_to_update is a pointer to the first element of an array of KQEMU_MAX_RAM_PAGES_TO_UPDATE longs. It is used to notify the VM that some RAM pages have been dirtied.
modified_ram_pages is a pointer to the first element of an array of KQEMU_MAX_MODIFIED_RAM_PAGES longs. It is used to notify the VM or the client that RAM pages have been modified.
The value 0 is return if the ioctl succeeded.
The following structure is used as input parameter:
struct kqemu_phys_mem {
uint64_t phys_addr;
uint64_t size;
uint64_t ram_addr;
uint32_t io_index;
uint32_t padding1;
};
The ioctl modifies the internal KQEMU physical to ram mappings. After the ioctl is executed, the physical address range [phys_addr; phys_addr + size[ is mapped to the RAM addresses [ram_addr; ram_addr + size[ if io_index is KQEMU_IO_MEM_RAM or KQEMU_IO_MEM_ROM. If KQEMU_IO_MEM_ROM is used, the writes to the RAM are ignored.
When io_index is KQEMU_IO_MEM_UNASSIGNED, it means the physical memory range corresponds to a device I/O region. When a memory access is done to it, KQEMU_EXEC returns with cpu_state.retval set to KQEMU_RET_SOFTMMU.
Input parameter: int nb_pages
Notify the VM that nb_pages RAM pages were modified. The corresponding RAM page addresses are written by the client in the init_state.modified_ram_pages array given with the KQEMU_INIT ioctl.
Note: This ioctl does currently nothing, but the clients must use it for later compatibility.
Input/Output parameter: struct kqemu_cpu_state cpu_state
Structure definitions:
struct kqemu_segment_cache {
uint16_t selector;
uint16_t padding1;
uint32_t flags;
uint64_t base;
uint32_t limit;
uint32_t padding2;
};
struct kqemu_cpu_state {
uint64_t regs[16];
uint64_t eip;
uint64_t eflags;
struct kqemu_segment_cache segs[6]; /* selector values */
struct kqemu_segment_cache ldt;
struct kqemu_segment_cache tr;
struct kqemu_segment_cache gdt; /* only base and limit are used */
struct kqemu_segment_cache idt; /* only base and limit are used */
uint64_t cr0;
uint64_t cr2;
uint64_t cr3;
uint64_t cr4;
uint64_t a20_mask;
/* sysenter registers */
uint64_t sysenter_cs;
uint64_t sysenter_esp;
uint64_t sysenter_eip;
uint64_t efer;
uint64_t star;
uint64_t lstar;
uint64_t cstar;
uint64_t fmask;
uint64_t kernelgsbase;
uint64_t tsc_offset;
uint64_t dr0;
uint64_t dr1;
uint64_t dr2;
uint64_t dr3;
uint64_t dr6;
uint64_t dr7;
uint8_t cpl;
uint8_t user_only;
uint16_t padding1;
uint32_t error_code; /* error_code when exiting with an exception */
uint64_t next_eip; /* next eip value when exiting with an interrupt */
uint32_t nb_pages_to_flush;
int32_t retval;
uint32_t nb_ram_pages_to_update;
uint32_t nb_modified_ram_pages;
};
Execute x86 instructions in the VM context. The full x86 CPU state is defined in this structure. It contains in particular the value of the 8 (or 16 for x86_64) general purpose registers, the contents of the segment caches, the RIP and EFLAGS values, etc...
If cpu_state.user_only is 1, a user only emulation is done. cpu_state.cpl must be 3 in that case.
KQEMU_EXEC does the following:
cpu_state.retval indicate the reason why the execution was stopped:
KQEMU_RET_EXCEPTION | n
The virtual CPU raised an exception and KQEMU cannot handle it. The exception number n is stored in the 8 low order bits. The field cpu_state.error_code contains the exception error code if it is needed. It should be noted that in user only emulation, KQEMU handles no exceptions by itself.
KQEMU_RET_INT | n
(user only emulation) The virtual CPU generated a software interrupt (INT instruction for example). The exception number n is stored in the 8 low order bits. The field cpu_state.next_eip contains value of RIP after the instruction raising the interrupt. cpu_state.eip contains the value of RIP at the intruction raising the interrupt.
KQEMU_RET_SOFTMMU
The virtual CPU could not handle the current instruction. This is not a fatal error. Usually the client just needs to interpret it. It can happen because of the following reasons:
* memory access to an unassigned address or unknown device type ;
* an instruction cannot be accurately executed by KQEMU (e.g. SYSENTER, HLT, ...) ;
* more than KQEMU_MAX_MODIFIED_RAM_PAGES were modified ;
* some unsupported bits were modified in CR0 or CR4 ;
* GDT.base or LDT.base are not a multiple of 8 ;
* the GDT or LDT tables were modified while CPL = 3 ;
* EFLAGS.VM was set.
KQEMU_RET_INTR
A signal from the OS interrupted KQEMU.
KQEMU_RET_SYSCALL
(user only emulation) The SYSCALL instruction was executed. The field cpu_state.next_eip contains value of RIP after the instruction. cpu_state.eip contains the RIP of the intruction.
KQEMU_RET_ABORT
An unrecoverable error was detected. This is usually due to a bug in KQEMU, so it should never happen !
The main priority when implementing KQEMU was simplicity and security. Unlike other virtualization systems, it does not do any dynamic translation nor code patching.
Note 1: KQEMU does not currently use the hardware virtualization features of newer x86 CPUs. We expect that the limitations would be different in that case.
Note 2: KQEMU supports both x86 and x86_64 CPUs.
Before entering the VM, the following conditions must be satisfied :
If EFLAGS.IF is set, the following assumptions are made on the executing code:
If eflags.IF if reset the code is interpreted, so the VM code can be accurately executed. Some intructions trap to the user space emulator because the interpreter does not handle them. A limitation of the interpreter is that currently segment limits are not always tested.
The VM code is always run with CPL = 3 on the host, so the VM code has no more priviliedge than regular user code.
The MMU is used to protect the memory used by the KQEMU monitor. That way, no segment limit patching is necessary. Moreover, the guest OS is free to use any virtual address, in particular the ones near the start or the end of the virtual address space. The price to pay is that CR3 must be modified at every emulated system call because different page tables are needed for user and kernel modes.