Internships/ProjectIdeas/ImageAnonymization
Anonymization of virtual disk images
Summary: Extend the qemu-img utility to drop all data from the virtual disk while preserving image metadata.
Virtual disk images like QCOW2, VHDX, or VMDK files may reach a bad state during their lifecycle and require debugging. This happens on the side of cloud or hosting providers and these images contain end-user (even not cloud provider) data. European cloud providers nowadays treat this under terms of GDPR privacy regulations and these image files cannot be easily sent to developers for investigation.
The idea of this project is to drop all end-user data from images, including data blocks, memory inside internal snapshots, etc. On the other hand, all bits and bytes of metadata of original image should be preserved even so-called "in-use" bit and internal metadata state. This will allow problematic image files to be debugged without transmitting the privacy-sensitive data contents of the disk image files.
The task is to implement a "qemu-img anonymize" command for the QCOW2 file format and also add support for the VHDX and VMDK file formats if time permits. This new command will not only help meet GDPR regulations but also make support more convenient for users because anonymized disk image files compress much better.
This project will allow you to learn about how disk image file formats work. You will become familiar with the internals of the QCOW2 file format and how data is laid out on disk.
Links:
- qemu-img utility source code
- qcow2 image file format specification
- Python script to anonymize the old QED file format
- General Data Protection Regulation
Details:
- Skill level: intermediate
- Language: C
- Mentor: Denis V. Lunev <den@openvz.org>
- Suggested by: Denis V. Lunev <den@openvz.org>