ToDo/Channel I/O Passthrough: Difference between revisions

Latest revision as of 11:18, 11 December 2019

This page lists areas in the implementation of vfio-ccw for channel I/O passthrough that still need work. It covers QEMU, the kernel part and the interface (most topics are expected to involve all three areas).

Missing architecture features

Support for unlimited prefetch

The current implementation prefetches the whole channel program, translates it and submits it to the hardware. This approach does not work if the guest does not want prefetching (e.g. because it wants to dynamically rewrite channel programs).

Other Indirect Data Address formats

The current implementation rejects either IDA format that utilizes 2K-byte block IDAs (either format). Probably low-reward, but it's an extra limitation bolted into the middle of the CCW processor.

Status Modifier

This should be addressed with commit 48bd0eee8eca ("s390/cio: Fix vfio-ccw handling of recursive TICs") but some further testing/proof would be beneficial.

Non-zero storage key

Nothing will prevent a guest from specifying a non-zero storage key in its ORB, but the one issued by vfio-ccw to hardware will always be zero. That seems not great.

I/O instructions not executed on the hardware

We started out with only passing START SUBCHANNEL to the hardware, while relying on QEMU to emulate any other I/O instruction the guest uses. While this takes care of a huge part of what is needed, we need to handle some more.

HALT SUBCHANNEL and CLEAR SUBCHANNEL

[done]

Terminating a running channel program is especially useful during error recovery, but there are devices that use e.g. a csch during their startup procedure (nothing we currently plan to support, though.) While they both terminate a running channel program, there are some differences:

hsch is accecpted while there is at most the start function specified (i.e., neither the halt nor the clear function).
csch is accecpted in any case (unless the subchannel is not operational). It will clear any start or halt function, and it will clear any pending I/O interrupt.

There is an inherent race condition when issuing any of these instructions: The scsw after a stsch may indicate that a start/halt is still in progress, but the subchannel may have become status pending with final status immediately afterwards. This needs careful serialization so that we don't get confused in the state machine and present a consistent status to the guest.

Linux patches included in Linux 5.2; QEMU patches included in QEMU 4.1.

CANCEL SUBCHANNEL

This instruction is used to cancel a start operation that has been accepted by the subchannel but not yet started executing. We have the same race condition as with hsch/csch. A major difference is that xsch will not generate an interrupt, nor will the guest get an interrupt for the ssch it issued.

The easiest option would be to give the guest a cc 2 in any case: That covers both

guest did a ssch/hsch, but did not get a status yet
guest did nothing (subchannel idle)

STORE SUBCHANNEL

The guest only gets QEMU's view of the subchannel when it executes stsch. It may want the hardware's view instead. Either pass this through, or trigger QEMU to update its view.

MODIFY SUBCHANNEL

Enable/disable is handled by QEMU (we keep the real subchannel enabled during usage by the mdev framework). Things become complicated if we want to support channel monitoring (currently emulated for virtio-ccw devices in QEMU).

SET CHANNEL MONITOR

Not a per-subchannel command, currently emulated by QEMU. This becomes hairy when we want to deal with both passthrough and emulated devices.

TEST PENDING INTERRUPTION and TEST SUBCHANNEL

It's probably fine to leave them as-is (emulated), as I/O interrupts have to be managed by the host anyway. We may need to make sure the control blocks are updated by tsch correctly (and avoid interference with other instructions).

RESUME SUBCHANNEL

It is unclear how rsch can work with the current infrastructure.

Interface considerations

Current status

For ssch processing, we added an I/O region:

 struct ccw_io_region {
 #define ORB_AREA_SIZE 12
         __u8    orb_area[ORB_AREA_SIZE];
 #define SCSW_AREA_SIZE 12
         __u8    scsw_area[SCSW_AREA_SIZE];
 #define IRB_AREA_SIZE 96
         __u8    irb_area[IRB_AREA_SIZE];
         __u32   ret_code;
 } __packed;

However, this has some problems:

Semantics of the fields are unclear (is the scsw_area supposed to contain the copy of an scsw, or is it used to convey a command by specifying the start function in the fctl field?)
While this is probably extensible for halt/clear handling, other commands may not work so well.
We mix up sending a command from user space to the vfio module and handling a status.

This is unfortunately not really well documented, either (Documentation/s390/vfio-ccw.txt only states that "scsw_area should be filled with the SCSW of the Virtual Subchannel").

In the future, this region should not be used for anything new; regions guarded via capabilities are a better choice.

Add documentation

Whatever else we do, we need to document everything properly:

Documentation/s390/vfio-ccw.txt should be more detailed/precise
More comments regarding the interface in the code
Anything else?

More I/O regions

Additional I/O regions for kernel/user space communication seem to be the way to go. They need to be guarded via capabilities, making the interface easily extensible.

A command area has been added for halt/clear handling.

Further regions that have been proposed:

Status area (containing scsw/pmcw/...)
CCW area
Measurements/statistics (for channel measurements etc.)

Big picture items

These may affect more than one guest at a time.

Channel path handling

[looked at by farman]

Who manages the paths (including path grouping), the guest or the host?
How do we reflect path changes to the guest?
Do we need special handling for DASD reserve/release?

Tracepoints

[looked at by farman]

We need some more of these, at strategic points.

Handling machine checks

This partially ties into path handling.

If we implement something like the ccw device 'disconnected' state, we need to relay IPI CRWs to the guest (both for 'device gone' and 'device operational again').
We need to decide to what extent we want to relay path-related CRWs to the guest (z/VM, for example, usually does not send these to guests).
There's also a machine check that indicates (path-related) information is available via CHSC, but there's no public documentation. Additionally, the same thing as for the last point applies.

Performance and scalability

We need to make sure that we don't include artificial bottlenecks.
Specifically, we need to check that the BQL isn't introducing scaling issues for many vfio-ccw devices.

Further features

Support for migration

[looked at by cohuck]

Migration for other types of vfio devices has been discussed already. Currently, we require to unplug all passthrough devices before we can migrate.

This should leverage a common framework for migration of vfio devices, to avoid code duplication etc. That framework is in the process of being designed -- we need to make sure that it accommodates our use case as well.

Current proposals by Intel and NVIDIA look extendable to ccw as well. In fact, ccw is probably easier to implement than pci.

Transport mode

We currently only support command mode (cf. the name 'ccw'). It might be feasible to support transport mode as well (handling tcws etc.), but it is unsure what benefit that would bring other than enabling special guests.

The code base might benefit from cleaning it up to a point that transport mode is cleanly split from command mode, though.

Things we won't support

This includes things that are either not really feasible, or would mean a lot of effort for little gain.

Non-I/O subchannels

No public documentation for CHSC, message, or EADM subchannels is available, and we don't know about possible pitfalls.

I/O subchannels in QDIO mode

No public documentation for QDIO is available, either.

Testing

All testing is currently performed manually; we should aim for some integration into testing frameworks, even if some setup still can't be automated.

KVM unit tests

It might be possible to integrate into the KVM unit tests framework as a 'nodefault' test, if we depend on setting up the host correctly and pass parameters like the subchannel id to the test.

Who is currently looking at what?

cohuck: Cornelia Huck <cohuck AT redhat dot com>

migration

farman: Eric Farman <farman AT ibm>

channel path handling
tracepoints

@@ Line 25: / Line 25: @@
 === HALT SUBCHANNEL and CLEAR SUBCHANNEL ===
-[looked at by cohuck]
+[done]
 Terminating a running channel program is especially useful during error recovery, but there are devices that use e.g. a csch during their startup procedure (nothing we currently plan to support, though.) While they both terminate a running channel program, there are some differences:
@@ Line 34: / Line 34: @@
 There is an inherent race condition when issuing any of these instructions: The scsw after a stsch may indicate that a start/halt is still in progress, but the subchannel may have become status pending with final status immediately afterwards. This needs careful serialization so that we don't get confused in the state machine and present a consistent status to the guest.
-Linux patches included in Linux 5.2; QEMU patches queued for 4.1.
+Linux patches included in Linux 5.2; QEMU patches included in QEMU 4.1.
 === CANCEL SUBCHANNEL ===
@@ Line 169: / Line 169: @@
 No public documentation for QDIO is available, either.
+== Testing ==
+All testing is currently performed manually; we should aim for some integration into testing frameworks, even if some setup still can't be automated.
+=== KVM unit tests ===
+It might be possible to integrate into the KVM unit tests framework as a 'nodefault' test, if we depend on setting up the host correctly and pass parameters like the subchannel id to the test.
 == Who is currently looking at what? ==
 cohuck: Cornelia Huck <cohuck AT redhat dot com>
-* halt/clear handling
 * migration