https://wiki.qemu.org/api.php?action=feedcontributions&user=Aglitke&feedformat=atomQEMU - User contributions [en]2024-03-28T19:41:47ZUser contributionsMediaWiki 1.39.1https://wiki.qemu.org/index.php?title=Google_Summer_of_Code_2011&diff=996Google Summer of Code 20112011-03-02T17:13:36Z<p>Aglitke: </p>
<hr />
<div>= Introduction =<br />
<br />
As we [[Google_Summer_of_Code_2010| did last year]], QEMU is going to apply as a mentoring organization for [http://socghop.appspot.com/ Google Summer of Code 2011]. This page contains our ideas list and some additional information for students and mentors.<br />
<br />
Please note that QEMU, as a GSoC organization, also includes the following projects:<br />
<br />
* The Linux Kernel's [http://www.linux-kvm.org/page/Main_Page KVM] module<br />
* [http://www.libvirt.org Libvirt], the virtualization library (pending OK from libvirt people)<br />
<br />
= Organization =<br />
<br />
Any Question, request or problem regarding QEMU on GSoC 2011, please contact one of the following people.<br />
<br />
* [[User:LuizCapitulino|Luiz Capitulino]]<br />
* [[User:AnthonyLiguori|Anthony Liguori]]<br />
<br />
= Find Us =<br />
<br />
* IRC (devel): #qemu on irc.oftc.net<br />
* IRC (GSoC specific): #qemu-gsoc on irc.oftc.net<br />
* Mailing list: http://lists.nongnu.org/mailman/listinfo/qemu-devel<br />
<br />
= GSoC important pages =<br />
<br />
* [http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs FAQ]<br />
* [http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/timeline Program Timeline]<br />
* [http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2011/userguide Melange User Guide]<br />
* [http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors Advice for Mentors]<br />
<br />
= Information for students =<br />
<br />
We require students to provide (at least) the following information in their applications:<br />
<br />
* Contact information (email, irc nick, phone number)<br />
* A general personal description (skills, past experiences and possible open source contributions)<br />
* Why QEMU and why this project<br />
* A detailed description of the approach the student will take<br />
<br />
'''VERY IMPORTANT:''' Submitting a patch and having it merged by QEMU or KVM increases your chances of being accepted.<br />
<br />
= Projects Ideas =<br />
<br />
This is the listing of suggested project ideas. It might be useful to check last year's [[Google_Summer_of_Code_2010#Projects_Ideas|page]]. Also note that students are free to suggest their own projects.<br />
<br />
== QCOW2 <-> QED image converter ==<br />
<br />
'''Summary:''' Design and implement an in-place disk image converter that safely and efficiently changes between the QCOW2 and QED image formats.<br />
<br />
QEMU supports several disk image formats that make it possible to manage and share virtual machine disk images as files. The well known formats include qcow2 (QEMU) and vmdk (VMware), and the QEMU Enhanced Disk (QED) format has pushed new levels of performance.<br />
<br />
In order for users to go between formats, the ''qemu-img convert'' command reads a disk image in one format and outputs it in another format. This has two limitations:<br />
# Twice the amount of space is required since both the old and the new image are kept around.<br />
# Copying data is slow for large images.<br />
<br />
The aim is to design a safe in-place converter to change from QCOW2 to QED (and vice versa) without copying image data. This will require understanding the QCOW2 and QED image formats and how they organize image data. You will need to carefully design the process so image data is never at risk in the event of a crash during conversion. Finally, you will be responsible for adding tests that defend this feature to the ''qemu-iotests'' suite.<br />
<br />
'''Links:'''<br />
* http://people.gnome.org/~markmc/qcow-image-format.html<br />
* [[Features/QED]]<br />
<br />
'''Please get in touch before applying''' so we can chat about your ideas and get to know each other.<br />
<br />
* Component: QEMU<br />
* Skill level: medium<br />
* Language: C<br />
* Mentor: Stefan Hajnoczi <stefanha@gmail.com>, 'stefanha' on IRC<br />
* Suggested by: Stefan Hajnoczi <stefanha@gmail.com><br />
<br />
== Improved image format compatibility ==<br />
<br />
'''Summary:''' Add support for the latest versions of popular image formats.<br />
<br />
There are a number of disk image formats in common use today. The ''qemu-img'' tool already supports the popular image formats. Some formats have evolved since their original support was added and new files cannot be accessed with ''qemu-img''.<br />
<br />
The aim is to address the lag in image format support by understanding the latest formats and implementing their new layouts in QEMU. This will enable ''qemu-img'', ''qemu-io'', and ''qemu-nbd'' to operate on an even wider range of disk images.<br />
<br />
'''Links:'''<br />
* http://en.wikipedia.org/wiki/VHD_(file_format)<br />
* http://en.wikipedia.org/wiki/VMDK<br />
* http://en.wikipedia.org/wiki/VirtualBox#VirtualBox_and_VDI<br />
<br />
'''Please get in touch before applying''' so we can chat about your ideas and get to know each other.<br />
<br />
* Component: QEMU<br />
* Skill level: medium<br />
* Language: C<br />
* Mentor: Stefan Hajnoczi <stefanha@gmail.com>, Kevin Wolf <kwolf@redhat.com><br />
* Suggested by: Stefan Hajnoczi <stefanha@gmail.com><br />
<br />
== Tracepoint support for the gdbstub ==<br />
<br />
Recent gdb versions allow to define ad-hoc tracepoints that are able to record memory content or register states whenever the target code hits them. This is supposed to happen non-intrusively, while the target is executing (almost) as normal. QEMU could serve as a nice backend for gdb when it comes to using such dynamic tracepoints for (guest) kernel debugging. In contrast to approaches like [http://code.google.com/p/kgtp kgtp] running inside the guest kernel, QEMU is able to perform this in hypervisor context, at most requiring to insert breakpoints into guest visible memory.<br />
<br />
In this project, the QEMU gdbstub shall be extended with support for tracepoints. Architecture specific parts shall at least support x86 guests. Tracepoints shall be usable both in emulation and KVM mode. Extending the KVM kernel services to accelerate tracepoints is not required in this first step. See [http://sourceware.org/gdb/onlinedocs/gdb/Tracepoints.html gdb documentation] and specifically the [http://sourceware.org/gdb/onlinedocs/gdb/Tracepoint-Packets.html gdb remote protocol] for further details.<br />
<br />
* Component: QEMU<br />
* Skill level: medium..high<br />
* Language: C<br />
* Mentor: Jan Kiszka <jan.kiszka@web.de><br />
* Suggested by: Jan Kiszka <jan.kiszka@web.de><br />
<br />
== Adding basic KVM support to MIPS architecture ==<br />
<br />
Summary: This project intends to add kvm virtualizaiton support on MIPS architecture. <br />
<br />
KVM supports several main CPU architectures such as x86,power PC, [http://www.ncl.cs.columbia.edu/publications/ols2010_kvmarm.pdf arm]. However there is no MIPS support currently. MIPS is a one of the popular architectures in embedded world. So it is really good if we can add kvm support to it.<br />
<br />
MIPS architecture does not have hardware virtualization support, the trap-emulation is the easy way for the CPU virtualization. For the memory virtualization, MMU is bypassed by MIPS's kseg0 segment, in which the kernel code residing. We either need to trap every memory access to this segment or remapping the guest linux kernel to other segment.<br />
<br />
Since adding new architecture support in KVM is a big task for GSOC, so this project just wants to add *basic* support. The GSOC goal of this project is a working KVM kernel module and KVM user program which can boot the guest linux kernel (without the rootfs).<br />
<br />
* Component: QEMU/KVM<br />
* Skill level: medium..high<br />
* Language: C<br />
* Mentor: Aurelien <aurelien@aurel32.net><br />
* Suggested by: yajin <yajin@vm-kernel.org><br />
<br />
== Upstreaming EHCI support ==<br />
<br />
There exists a good foundation for EHCI support in out-of-tree repository. But it hasn't been proposed for merge yet due to a few open issues. The list below reflects potential sub-tasks but it's not necessarily up-to-date with latest development:<br />
<br />
* testing and stabilizing host pass-through of various devices<br />
* periodic frames support<br />
* isochronous traffic support<br />
* split transactions support<br />
* improving NAK/reload support<br />
* throttle interrupt rate based on OS settings<br />
* code cleanup<br />
<br />
The primary goal of this task is to fix the most annoying issues of the EHCI emulation and prepare the result of upstream merge.<br />
<br />
* Component: QEMU<br />
* Skill level: medium..high<br />
* Language: C<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]], any other welcome<br />
* Suggested by: Jan Kiszka <jan.kiszka@web.de><br />
<br />
== Improving USB emulation accuracy ==<br />
<br />
There are still a few rough edges in QEMU's USB support beyond the EHCI topic:<br />
<br />
* improve USB device emulation like mass storage devices, network adapters, etc.<br />
* improved topology configuration by modeling it via qdev<br />
* new device emulations, e.g. webcam (playing video files) or IO-warrior-like device<br />
* ...<br />
<br />
The task consists of identifying open issues and use cases by testing them against various guest systems, then fixing the deficits, and proposing the result in form of patch series for upstream merge.<br />
<br />
* Component: QEMU<br />
* Skill level: medium..high<br />
* Language: C<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]], any other welcome<br />
* Suggested by: Jan Kiszka <jan.kiszka@web.de><br />
<br />
== Upstreaming some of the Android emulator bits ==<br />
<br />
The Android Emulator is based on ancient QEMU. To kick off its upstream integration, the existing code shall be analyzed and core elements of the emulated reference platform shall be ported to current QEMU. The goal is to get some Android image booting, bringing it into a usable state so that simple applications can be tested.<br />
<br />
* Skill level: medium..high<br />
* Languages: C<br />
* Mentor: Jan Kiszka <jan.kiszka@web.de><br />
* Suggested by: Jan Kiszka <jan.kiszka@web.de><br />
<br />
== Add Macintosh to 68k system emulation ==<br />
<br />
In order to support Macintosh system emulation, almost every device must be implemented on QEMU (SCSI, CUDA, ADB, Apple framebuffers). How they work can be investigated in Inside Macintosh documents and other emulators (MESS, BasiliskII, vMac).<br />
<br />
* Skill level: high<br />
* Languages: C, 68k assembler<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]]<br />
* Suggested by: [[User:NataliaPortillo|Natalia Portillo]]<br />
<br />
== Boot Mac OS >= 8.5 on PowerPC system emulation ==<br />
<br />
Most of Power Macintosh hardware is emulated, things need only to be cleaned and OpenBIOS enhanced to support loading Macintosh Toolbox from the "Mac OS ROM" file present in any Mac OS >= 8.5 system.<br />
<br />
* Skill level: medium<br />
* Languages: C, Forth<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]]<br />
* Suggested by: [[User:NataliaPortillo|Natalia Portillo]]<br />
<br />
== Add a S3 Trio or S3 Virge ==<br />
<br />
More x86 guests have native drivers for that card than for Cirrus GD5446. It was also used for a lot of non-x86 machines (like IBM workstations and servers).<br />
<br />
* Skill level: medium<br />
* Languages: C, x86 assembler<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]]<br />
* Suggested by: [[User:NataliaPortillo|Natalia Portillo]]<br />
<br />
== Enhance, update and integrate Acorn Archimedes system emulation ==<br />
<br />
ARM system emulation should include Acorn Archimedes system emulation. Work-In-Progress was done against 0.9.0 tree. Now with Risc OS open sourced things could be easier. Most problems seems to be rarely used opcodes and 26-bit modes.<br />
<br />
* Skill level: medium<br />
* Languages: C, ARM assembler<br />
* Mentor: tbd ([[User:pbrook|Paul Brook]] proposed himself on GSoC 2010)<br />
* Suggested by: [[User:NataliaPortillo|Natalia Portillo]]<br />
<br />
== BeBox system emulation ==<br />
<br />
The BeBox system is just a CHRP compliant dual PowerPC 603 processor machine. Most of the devices are already emulated, only a couple need to be added. Original firmware can be reverse engineered as it is a very simple firmware (not OpenFirmware compliant).<br />
<br />
* Skill level: medium<br />
* Languages: C, maybe PowerPC assembler<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]]<br />
* Suggested by: [[User:NataliaPortillo|Natalia Portillo]]<br />
<br />
== NeXT machines system emulation ==<br />
<br />
NeXT machines are designed in a similar idea to 68k Macintosh ones. Documentation is almost not available. Original firmware MUST be used. MESS emulator project started a NeXT emulation but it is still work-in-progress so not much ideas can be taken from it.<br />
<br />
* Skill level: high<br />
* Languages: C, 68k assembler<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]]<br />
* Suggested by: [[User:NataliaPortillo|Natalia Portillo]]<br />
<br />
== Upstream and clean-up of USB Video Class device emulation ==<br />
<br />
On mid-2010 I implemented a webcam emulation (passthrough) using USB Video Class on the guest and Video4Linux on the host.<br />
The code was RFC to the mailing list and needs a couple of clean ups to be integrated mainstream.<br />
Once cleaned up, it needs to get implemented multiple resolution, image formats and isochronous transfers.<br />
Desirable is also adding support for Win32 (using WIA or VFW) and Mac OS X hosts.<br />
<br />
* Skill level: medium<br />
* Languages: C<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]]<br />
* Suggested by: [[User:NataliaPortillo|Natalia Portillo]]<br />
<br />
== Implementation of FireWire OHCI ==<br />
<br />
Implementing a FireWire OHCI emulation will allow us to passthrough FireWire devices (mass storage, tape devices, video devices, IPo1394), or to emulate that devices from different host devices.<br />
<br />
FireWire extensively uses DMA and should be as easy to implement as USB protocol, without the issues of the bulk mode of USB protocol.<br />
<br />
* Skill level: high<br />
* Languages: C, 68k assembler<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]]<br />
* Suggested by: [[User:NataliaPortillo|Natalia Portillo]]<br />
<br />
== Implementation of USB 3.0 XHCI ==<br />
<br />
USB 3.0 gives better support for virtualization, and also unlike EHCI does not request the presence of previous generation controllers (EHCI, OHCI, UHCI) for handling previous generation devices (USB 2.0, USB 1.x).<br />
<br />
This requires also taking in account Gerd patches for multiple speeds support on existing USB devices.<br />
<br />
* Skill level: high<br />
* Languages: C, 68k assembler<br />
* Mentor: [[User:NataliaPortillo|Natalia Portillo]]<br />
* Suggested by: [[User:NataliaPortillo|Natalia Portillo]]<br />
<br />
== Virtagent Windows guest support ==<br />
<br />
'''Summary:''' Create a virtagent compatible guest agent for Windows operating systems.<br />
<br />
Virtagent is a host/guest communication protocol that is designed to enable easier and more reliable guest management. An RPC channel is created over either a virtio-serial or isa-serial device. Various commands are implemented such as: shutdown, ping and file retrieval. Currently, only Linux guests are supported. In order to make virtagent a more universal management interface, it should be supported on other operating systems (including Windows). This task can be broken down into the following activities:<br />
* virtio-serial support for Windows<br />
* Base Windows client support (Windows service programming / virtagent transport protocol implementation)<br />
* Implement RPC functions (OS Shutdown, file transfers, filesystem freeze/thaw, etc)<br />
* Windows package / installer<br />
<br />
Applicants will need to have experience with Windows system programming.<br />
<br />
* Component: QEMU<br />
* Skill level: medium<br />
* Language: C<br />
* Mentor: Adam Litke <agl@us.ibm.com>, 'aglitke' on IRC<br />
* Suggested by: Adam Litke <agl@us.ibm.com></div>Aglitkehttps://wiki.qemu.org/index.php?title=Features/QED&diff=660Features/QED2010-11-22T14:48:24Z<p>Aglitke: Undo revision 659 by Aglitke (Talk)</p>
<hr />
<div>=Specification=<br />
<br />
The file format looks like this:<br />
<br />
+----------+----------+----------+-----+<br />
| cluster0 | cluster1 | cluster2 | ... |<br />
+----------+----------+----------+-----+<br />
<br />
The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters.<br />
<br />
Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.<br />
<br />
All fields are little-endian.<br />
<br />
==Header==<br />
Header {<br />
uint32_t magic; /* QED\0 */<br />
<br />
uint32_t cluster_size; /* in bytes */<br />
uint32_t table_size; /* for L1 and L2 tables, in clusters */<br />
uint32_t header_size; /* in clusters */<br />
<br />
uint64_t features; /* format feature bits */<br />
uint64_t compat_features; /* compat feature bits */<br />
uint64_t autoclear_features; /* self-resetting feature bits */<br />
<br />
uint64_t l1_table_offset; /* in bytes */<br />
uint64_t image_size; /* total logical image size, in bytes */<br />
<br />
/* if (features & QED_F_BACKING_FILE) */<br />
uint32_t backing_filename_offset; /* in bytes from start of header */<br />
uint32_t backing_filename_size; /* in bytes */<br />
}<br />
<br />
Field descriptions:<br />
* ''cluster_size'' must be a power of 2 in range [2^12, 2^26].<br />
* ''table_size'' must be a power of 2 in range [1, 16].<br />
* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters.<br />
* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows:<br />
** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits.<br />
** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format.<br />
** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later.<br />
* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''.<br />
* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes.<br />
* ''backing_filename'' is a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints.<br />
<br />
Feature bits:<br />
* QED_F_BACKING_FILE = 0x01. The image uses a backing file. The backing filename string is given in the ''backing_filename_{offset,size}'' fields and may be an absolute path or relative to the image file.<br />
* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.<br />
* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing images are never detected as an image format if they happen to contain magic constants.<br />
<br />
There are currently no defined ''compat_features'' or ''autoclear_features'' bits.<br />
<br />
Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.<br />
<br />
==Tables==<br />
<br />
Tables provide the translation from logical offsets in the block device to cluster offsets in the file.<br />
<br />
#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))<br />
<br />
Table {<br />
uint64_t offsets[TABLE_NOFFSETS];<br />
}<br />
<br />
The tables are organized as follows:<br />
<br />
+----------+<br />
| L1 table |<br />
+----------+<br />
,------' | '------.<br />
+----------+ | +----------+<br />
| L2 table | ... | L2 table |<br />
+----------+ +----------+<br />
,------' | '------.<br />
+----------+ | +----------+<br />
| Data | ... | Data |<br />
+----------+ +----------+<br />
<br />
A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.<br />
<br />
The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:<br />
header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size<br />
<br />
All offsets in L1 and L2 tables are cluster-aligned. The least significant bits up to ''cluster_size'' are reserved and must be zero. This may be used in future format extensions to store per-offset information.<br />
<br />
The following offsets have special meanings:<br />
<br />
===L2 table offsets===<br />
* 0 - unallocated. The L2 table is not yet allocated.<br />
<br />
===Data cluster offsets===<br />
* 0 - unallocated. The data cluster is not yet allocated.<br />
<br />
===Unallocated L2 tables and data clusters===<br />
Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.<br />
<br />
Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing image (or zeroes if no backing image) and the data being written.<br />
<br />
===Logical offset translation===<br />
Logical offsets are translated into cluster offsets as follows:<br />
<br />
table_bits table_bits cluster_bits<br />
<--------> <--------> <---------------><br />
+----------+----------+-----------------+<br />
| L1 index | L2 index | byte offset |<br />
+----------+----------+-----------------+<br />
<br />
Structure of a logical offset<br />
<br />
offset_mask = ~(cluster_size - 1) # mask for the image file byte offset<br />
<br />
def logical_to_cluster_offset(l1_index, l2_index, byte_offset):<br />
l2_offset = l1_table[l1_index]<br />
l2_table = load_table(l2_offset)<br />
cluster_offset = l2_table[l2_index] & offset_mask<br />
return cluster_offset + byte_offset<br />
<br />
==Consistency checking==<br />
<br />
This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit.<br />
<br />
The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.<br />
<br />
Consistency check includes the following invariants:<br />
# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references.<br />
# Offsets must be within the image file size and must be ''cluster_size'' aligned.<br />
# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table.<br />
<br />
The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.<br />
<br />
=Operations=<br />
<br />
==Read==<br />
# If L2 table is not present in L1, read from backing image.<br />
# If data cluster is not present in L2, read from backing image or zero fill if no backing image.<br />
# Otherwise read data from cluster.<br />
<br />
==Write==<br />
# If L2 table is not present in L1, allocate new cluster and L2. Perform L2 and L1 link after writing data.<br />
# If data cluster is not present in L2, allocate new cluster. Perform L1 link after writing data.<br />
# Otherwise overwrite data cluster.<br />
<br />
The L2 link '''should''' be made after the data is in place on storage. However, when no ordering is enforced the worst case scenario is an L2 link to an unwritten cluster.<br />
<br />
The L1 link '''must''' be made after the L2 cluster is in place on storage. If the order is reversed then the L1 table may point to a bogus L2 table. (Is this a problem since clusters are allocated at the end of the file?)<br />
<br />
==Grow==<br />
# If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW. The L1 table is not big enough.<br />
# Write new image_size header field.<br />
<br />
=Data integrity=<br />
==Write==<br />
Writes that complete before a flush must be stable when the flush completes.<br />
<br />
If storage is interrupted (e.g. power outage) then writes in progress may be lost, stable, or partially completed. The storage must not be otherwise corrupted or inaccessible after it is restarted.<br />
<br />
= Future Features =<br />
* [[Features/QED/Streaming|Streaming]]<br />
* [[Features/QED/OnlineDefrag|Online defragmentation]]<br />
* [[Features/QED/Trim|Trim]]<br />
* [[Features/QED/ParallelSubmission|Parallel submission]]<br />
* [[Features/QED/ScanAvoidance|Meta-data scan avoidance]]</div>Aglitkehttps://wiki.qemu.org/index.php?title=Features/QED&diff=659Features/QED2010-11-22T14:46:54Z<p>Aglitke: /* Header */</p>
<hr />
<div>=Specification=<br />
<br />
The file format looks like this:<br />
<br />
+----------+----------+----------+-----+<br />
| cluster0 | cluster1 | cluster2 | ... |<br />
+----------+----------+----------+-----+<br />
<br />
The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters.<br />
<br />
Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.<br />
<br />
All fields are little-endian.<br />
<br />
==Header==<br />
Header {<br />
uint32_t magic; /* QED\0 */<br />
<br />
uint32_t cluster_size; /* in bytes */<br />
uint32_t table_size; /* for L1 and L2 tables, in clusters */<br />
uint32_t header_size; /* in clusters */<br />
<br />
uint64_t features; /* format feature bits */<br />
uint64_t compat_features; /* compat feature bits */<br />
uint64_t autoclear_features; /* self-resetting feature bits */<br />
<br />
uint64_t l1_table_offset; /* in bytes */<br />
uint64_t image_size; /* total logical image size, in bytes */<br />
<br />
/* if (features & QED_F_BACKING_FILE) */<br />
uint32_t backing_filename_offset; /* in bytes from start of header */<br />
uint32_t backing_filename_size; /* in bytes */<br />
}<br />
<br />
Field descriptions:<br />
* ''cluster_size'' must be a power of 2 in range [2^12, 2^26].<br />
* ''table_size'' must be a power of 2 in range [1, 16].<br />
* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters.<br />
* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows:<br />
** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits.<br />
** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format.<br />
** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later.<br />
* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''.<br />
* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes.<br />
* ''backing_filename'' is a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints.<br />
<br />
Feature bits:<br />
* QED_F_BACKING_FILE = 0x01. The image uses a backing file. The backing filename string is given in the ''backing_filename_{offset,size}'' fields and may be an absolute path or relative to the image file.<br />
* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.<br />
* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing images are never detected as an image format if they happen to contain magic constants.<br />
<br />
There are currently no defined ''compat_features'' or ''autoclear_features'' bits.<br />
<br />
Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.<br />
<br />
==Tables==<br />
<br />
Tables provide the translation from logical offsets in the block device to cluster offsets in the file.<br />
<br />
#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))<br />
<br />
Table {<br />
uint64_t offsets[TABLE_NOFFSETS];<br />
}<br />
<br />
The tables are organized as follows:<br />
<br />
+----------+<br />
| L1 table |<br />
+----------+<br />
,------' | '------.<br />
+----------+ | +----------+<br />
| L2 table | ... | L2 table |<br />
+----------+ +----------+<br />
,------' | '------.<br />
+----------+ | +----------+<br />
| Data | ... | Data |<br />
+----------+ +----------+<br />
<br />
A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.<br />
<br />
The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:<br />
header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size<br />
<br />
All offsets in L1 and L2 tables are cluster-aligned. The least significant bits up to ''cluster_size'' are reserved and must be zero. This may be used in future format extensions to store per-offset information.<br />
<br />
The following offsets have special meanings:<br />
<br />
===L2 table offsets===<br />
* 0 - unallocated. The L2 table is not yet allocated.<br />
<br />
===Data cluster offsets===<br />
* 0 - unallocated. The data cluster is not yet allocated.<br />
<br />
===Unallocated L2 tables and data clusters===<br />
Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.<br />
<br />
Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing image (or zeroes if no backing image) and the data being written.<br />
<br />
===Logical offset translation===<br />
Logical offsets are translated into cluster offsets as follows:<br />
<br />
table_bits table_bits cluster_bits<br />
<--------> <--------> <---------------><br />
+----------+----------+-----------------+<br />
| L1 index | L2 index | byte offset |<br />
+----------+----------+-----------------+<br />
<br />
Structure of a logical offset<br />
<br />
offset_mask = ~(cluster_size - 1) # mask for the image file byte offset<br />
<br />
def logical_to_cluster_offset(l1_index, l2_index, byte_offset):<br />
l2_offset = l1_table[l1_index]<br />
l2_table = load_table(l2_offset)<br />
cluster_offset = l2_table[l2_index] & offset_mask<br />
return cluster_offset + byte_offset<br />
<br />
==Consistency checking==<br />
<br />
This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit.<br />
<br />
The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.<br />
<br />
Consistency check includes the following invariants:<br />
# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references.<br />
# Offsets must be within the image file size and must be ''cluster_size'' aligned.<br />
# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table.<br />
<br />
The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.<br />
<br />
=Operations=<br />
<br />
==Read==<br />
# If L2 table is not present in L1, read from backing image.<br />
# If data cluster is not present in L2, read from backing image or zero fill if no backing image.<br />
# Otherwise read data from cluster.<br />
<br />
==Write==<br />
# If L2 table is not present in L1, allocate new cluster and L2. Perform L2 and L1 link after writing data.<br />
# If data cluster is not present in L2, allocate new cluster. Perform L1 link after writing data.<br />
# Otherwise overwrite data cluster.<br />
<br />
The L2 link '''should''' be made after the data is in place on storage. However, when no ordering is enforced the worst case scenario is an L2 link to an unwritten cluster.<br />
<br />
The L1 link '''must''' be made after the L2 cluster is in place on storage. If the order is reversed then the L1 table may point to a bogus L2 table. (Is this a problem since clusters are allocated at the end of the file?)<br />
<br />
==Grow==<br />
# If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW. The L1 table is not big enough.<br />
# Write new image_size header field.<br />
<br />
=Data integrity=<br />
==Write==<br />
Writes that complete before a flush must be stable when the flush completes.<br />
<br />
If storage is interrupted (e.g. power outage) then writes in progress may be lost, stable, or partially completed. The storage must not be otherwise corrupted or inaccessible after it is restarted.<br />
<br />
= Future Features =<br />
* [[Features/QED/Streaming|Streaming]]<br />
* [[Features/QED/OnlineDefrag|Online defragmentation]]<br />
* [[Features/QED/Trim|Trim]]<br />
* [[Features/QED/ParallelSubmission|Parallel submission]]<br />
* [[Features/QED/ScanAvoidance|Meta-data scan avoidance]]</div>Aglitke