Internships/ProjectIdeas/ImageAnonymization: Difference between revisions

From QEMU
No edit summary
Line 1: Line 1:
=== Anonymization of virtual disk images ===
== Anonymization of virtual disk images ==
'''Summary:''' Extend qemu-img utility to drop all data from the virtual disk while preserving image metadata
Virtual disk images like QCOW2 or QED comes into bad state during their lifecycle. This happens on the side of cloud or hosting provides and these images contains end-user (even not cloud provider) data. European cloud providers nowadays treat this under under terms of GDPR and these images could not be easily sent to developers for investigation.


The idea of this project is to drop all end-user data from images, including data blocks, memory inside internal snapshots etc. On the other hand, all bits and bytes of metadata of original image should be preserved even so-called "in-use" bit and other stuff.
'''Summary:''' Extend the qemu-img utility to drop all data from the virtual disk while preserving image metadata.
 
Virtual disk images like QCOW2, VHDX, or VMDK files may reach a bad state during their lifecycle and require debugging.  This happens on the side of cloud or hosting providers and these images contain end-user (even not cloud provider) data.  European cloud providers nowadays treat this under under terms of GDPR privacy regulations and these image files cannot be easily sent to developers for investigation.
 
The idea of this project is to drop all end-user data from images, including data blocks, memory inside internal snapshots, etc. On the other hand, all bits and bytes of metadata of original image should be preserved even so-called "in-use" bit and internal metadata state.  This will allow problematic image files to be debugged without transmitting the privacy-sensitive data contents of the disk image files.
 
The task is to implement a "qemu-img anonymize" command for the QCOW2 file format and also add support for the VHDX and VMDK file formats if time permits.  This new command will not only help meet GDPR regulations but also make support more convenient for users because anonymized disk image files compress much better.
 
This project will allow you to learn about how disk image file formats work.  You will become familiar with the internals of the QCOW2 file format and how data is laid out on disk.


It would be nice to have for QCOW2 disk formats but the task could be easily extended for other supported disk formats.
'''Links:'''
'''Links:'''
* [https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-img.c;h=6233b8ca5682afd2f98bf33a6cb17997b5f1193e;hb=HEAD qemu-img utility source code]
* [https://git.qemu.org/?p=qemu.git;a=blob;f=docs/interop/qcow2.txt;h=af5711e5337191d2c01932b0b3d07566b389f2c2;hb=HEAD qcow2 image file format specification]
* [https://git.qemu.org/?p=qemu.git;a=blob;f=tests/qemu-iotests/qed.py;h=8adaaf46c4ace924e6584cfaef19a13b3b593f22;hb=HEAD#l182 Python script to anonymize the old QED file format]
* [https://en.wikipedia.org/wiki/General_Data_Protection_Regulation General Data Protection Regulation]
* [https://en.wikipedia.org/wiki/General_Data_Protection_Regulation General Data Protection Regulation]
 
'''Details:'''
'''Details:'''
* Skill level: intermediate
* Skill level: intermediate

Revision as of 11:35, 27 January 2020

Anonymization of virtual disk images

Summary: Extend the qemu-img utility to drop all data from the virtual disk while preserving image metadata.

Virtual disk images like QCOW2, VHDX, or VMDK files may reach a bad state during their lifecycle and require debugging. This happens on the side of cloud or hosting providers and these images contain end-user (even not cloud provider) data. European cloud providers nowadays treat this under under terms of GDPR privacy regulations and these image files cannot be easily sent to developers for investigation.

The idea of this project is to drop all end-user data from images, including data blocks, memory inside internal snapshots, etc. On the other hand, all bits and bytes of metadata of original image should be preserved even so-called "in-use" bit and internal metadata state. This will allow problematic image files to be debugged without transmitting the privacy-sensitive data contents of the disk image files.

The task is to implement a "qemu-img anonymize" command for the QCOW2 file format and also add support for the VHDX and VMDK file formats if time permits. This new command will not only help meet GDPR regulations but also make support more convenient for users because anonymized disk image files compress much better.

This project will allow you to learn about how disk image file formats work. You will become familiar with the internals of the QCOW2 file format and how data is laid out on disk.

Links:

Details:

  • Skill level: intermediate
  • Language: C
  • Mentor: Denis V. Lunev <den@openvz.org>
  • Suggested by: Denis V. Lunev <den@openvz.org>