ToDo/Channel I/O Passthrough: Difference between revisions
No edit summary |
No edit summary |
||
(13 intermediate revisions by 2 users not shown) | |||
Line 6: | Line 6: | ||
The current implementation prefetches the whole channel program, translates it and submits it to the hardware. This approach does not work if the guest does not want prefetching (e.g. because it wants to dynamically rewrite channel programs). | The current implementation prefetches the whole channel program, translates it and submits it to the hardware. This approach does not work if the guest does not want prefetching (e.g. because it wants to dynamically rewrite channel programs). | ||
=== Other Indirect Data Address formats === | |||
The current implementation rejects either IDA format that utilizes 2K-byte block IDAs (either format). Probably low-reward, but it's an extra limitation bolted into the middle of the CCW processor. | |||
=== Status Modifier === | |||
This should be addressed with commit 48bd0eee8eca ("s390/cio: Fix vfio-ccw handling of recursive TICs") but some further testing/proof would be beneficial. | |||
=== Non-zero storage key === | |||
Nothing will prevent a guest from specifying a non-zero storage key in its ORB, but the one issued by vfio-ccw to hardware will always be zero. That seems not great. | |||
== I/O instructions not executed on the hardware == | == I/O instructions not executed on the hardware == | ||
Line 13: | Line 25: | ||
=== HALT SUBCHANNEL and CLEAR SUBCHANNEL === | === HALT SUBCHANNEL and CLEAR SUBCHANNEL === | ||
[ | [done] | ||
Terminating a running channel program is especially useful during error recovery, but there are devices that use e.g. a csch during their startup procedure (nothing we currently plan to support, though.) While they both terminate a running channel program, there are some differences: | Terminating a running channel program is especially useful during error recovery, but there are devices that use e.g. a csch during their startup procedure (nothing we currently plan to support, though.) While they both terminate a running channel program, there are some differences: | ||
Line 21: | Line 33: | ||
There is an inherent race condition when issuing any of these instructions: The scsw after a stsch may indicate that a start/halt is still in progress, but the subchannel may have become status pending with final status immediately afterwards. This needs careful serialization so that we don't get confused in the state machine and present a consistent status to the guest. | There is an inherent race condition when issuing any of these instructions: The scsw after a stsch may indicate that a start/halt is still in progress, but the subchannel may have become status pending with final status immediately afterwards. This needs careful serialization so that we don't get confused in the state machine and present a consistent status to the guest. | ||
Linux patches included in Linux 5.2; QEMU patches included in QEMU 4.1. | |||
=== CANCEL SUBCHANNEL === | === CANCEL SUBCHANNEL === | ||
This instruction is used to cancel a start operation that has been accepted by the subchannel but not yet started executing. We have the same race condition as with hsch/csch. A major difference is that xsch will not generate an interrupt, nor will the guest get an interrupt for the ssch it issued. | This instruction is used to cancel a start operation that has been accepted by the subchannel but not yet started executing. We have the same race condition as with hsch/csch. A major difference is that xsch will not generate an interrupt, nor will the guest get an interrupt for the ssch it issued. | ||
The easiest option would be to give the guest a cc 2 in any case: That covers both | |||
* guest did a ssch/hsch, but did not get a status yet | |||
* guest did nothing (subchannel idle) | |||
=== STORE SUBCHANNEL === | === STORE SUBCHANNEL === | ||
Line 40: | Line 58: | ||
=== TEST PENDING INTERRUPTION and TEST SUBCHANNEL === | === TEST PENDING INTERRUPTION and TEST SUBCHANNEL === | ||
It's probably fine to leave them as-is (emulated), as | It's probably fine to leave them as-is (emulated), as I/O interrupts have to be managed by the host anyway. We may need to make sure the control blocks are updated by tsch correctly (and avoid interference with other instructions). | ||
=== RESUME SUBCHANNEL === | |||
It is unclear how rsch can work with the current infrastructure. | |||
== Interface considerations == | == Interface considerations == | ||
=== Current status === | |||
For ssch processing, we added an I/O region: | For ssch processing, we added an I/O region: | ||
Line 61: | Line 85: | ||
* Semantics of the fields are unclear (is the scsw_area supposed to contain the copy of an scsw, or is it used to convey a command by specifying the start function in the fctl field?) | * Semantics of the fields are unclear (is the scsw_area supposed to contain the copy of an scsw, or is it used to convey a command by specifying the start function in the fctl field?) | ||
* While this is probably extensible for halt/clear handling, other commands may not work so well. | * While this is probably extensible for halt/clear handling, other commands may not work so well. | ||
* We mix up sending a command from user space to the vfio module and handling a status. | |||
This is unfortunately not really well documented, either (Documentation/s390/vfio-ccw.txt only states that "scsw_area should be filled with the SCSW of the Virtual Subchannel"). | |||
In the future, this region should not be used for anything new; regions guarded via capabilities are a better choice. | |||
=== Add documentation === | |||
Whatever else we do, we need to document everything properly: | |||
* Documentation/s390/vfio-ccw.txt should be more detailed/precise | |||
* More comments regarding the interface in the code | |||
* Anything else? | |||
=== More I/O regions === | |||
Additional I/O regions for kernel/user space communication seem to be the way to go. They need to be guarded via capabilities, making the interface easily extensible. | |||
A command area has been added for halt/clear handling. | |||
Further regions that have been proposed: | |||
* Status area (containing scsw/pmcw/...) | |||
* CCW area | |||
* Measurements/statistics (for channel measurements etc.) | |||
== Big picture items == | == Big picture items == | ||
Line 69: | Line 114: | ||
=== Channel path handling === | === Channel path handling === | ||
[looked at by farman] | |||
* Who manages the paths (including path grouping), the guest or the host? | * Who manages the paths (including path grouping), the guest or the host? | ||
* How do we reflect path changes to the guest? | * How do we reflect path changes to the guest? | ||
* Do we need special handling for DASD reserve/release? | * Do we need special handling for DASD reserve/release? | ||
=== Tracepoints === | |||
[looked at by farman] | |||
We need some more of these, at strategic points. | |||
=== Handling machine checks === | |||
This partially ties into path handling. | |||
* If we implement something like the ccw device 'disconnected' state, we need to relay IPI CRWs to the guest (both for 'device gone' and 'device operational again'). | |||
* We need to decide to what extent we want to relay path-related CRWs to the guest (z/VM, for example, usually does not send these to guests). | |||
* There's also a machine check that indicates (path-related) information is available via CHSC, but there's no public documentation. Additionally, the same thing as for the last point applies. | |||
=== Performance and scalability === | |||
* We need to make sure that we don't include artificial bottlenecks. | |||
* Specifically, we need to check that the BQL isn't introducing scaling issues for many vfio-ccw devices. | |||
== Further features == | |||
=== Support for migration === | |||
[looked at by cohuck] | |||
Migration for other types of vfio devices has been discussed already. Currently, we require to unplug all passthrough devices before we can migrate. | |||
This should leverage a common framework for migration of vfio devices, to avoid code duplication etc. That framework is in the process of being designed -- we need to make sure that it accommodates our use case as well. | |||
Current proposals by Intel and NVIDIA look extendable to ccw as well. In fact, ccw is probably easier to implement than pci. | |||
=== Transport mode === | |||
We currently only support command mode (cf. the name 'ccw'). It might be feasible to support transport mode as well (handling tcws etc.), but it is unsure what benefit that would bring other than enabling special guests. | |||
The code base might benefit from cleaning it up to a point that transport mode is cleanly split from command mode, though. | |||
== Things we won't support == | == Things we won't support == | ||
Line 85: | Line 169: | ||
No public documentation for QDIO is available, either. | No public documentation for QDIO is available, either. | ||
== Testing == | |||
All testing is currently performed manually; we should aim for some integration into testing frameworks, even if some setup still can't be automated. | |||
=== KVM unit tests === | |||
It might be possible to integrate into the KVM unit tests framework as a 'nodefault' test, if we depend on setting up the host correctly and pass parameters like the subchannel id to the test. | |||
== Who is currently looking at what? == | == Who is currently looking at what? == | ||
cohuck: Cornelia Huck <cohuck AT redhat dot com> | cohuck: Cornelia Huck <cohuck AT redhat dot com> | ||
* migration | |||
farman: Eric Farman <farman AT ibm> | |||
* channel path handling | |||
* tracepoints |
Latest revision as of 11:18, 11 December 2019
This page lists areas in the implementation of vfio-ccw for channel I/O passthrough that still need work. It covers QEMU, the kernel part and the interface (most topics are expected to involve all three areas).
Missing architecture features
Support for unlimited prefetch
The current implementation prefetches the whole channel program, translates it and submits it to the hardware. This approach does not work if the guest does not want prefetching (e.g. because it wants to dynamically rewrite channel programs).
Other Indirect Data Address formats
The current implementation rejects either IDA format that utilizes 2K-byte block IDAs (either format). Probably low-reward, but it's an extra limitation bolted into the middle of the CCW processor.
Status Modifier
This should be addressed with commit 48bd0eee8eca ("s390/cio: Fix vfio-ccw handling of recursive TICs") but some further testing/proof would be beneficial.
Non-zero storage key
Nothing will prevent a guest from specifying a non-zero storage key in its ORB, but the one issued by vfio-ccw to hardware will always be zero. That seems not great.
I/O instructions not executed on the hardware
We started out with only passing START SUBCHANNEL to the hardware, while relying on QEMU to emulate any other I/O instruction the guest uses. While this takes care of a huge part of what is needed, we need to handle some more.
HALT SUBCHANNEL and CLEAR SUBCHANNEL
[done]
Terminating a running channel program is especially useful during error recovery, but there are devices that use e.g. a csch during their startup procedure (nothing we currently plan to support, though.) While they both terminate a running channel program, there are some differences:
- hsch is accecpted while there is at most the start function specified (i.e., neither the halt nor the clear function).
- csch is accecpted in any case (unless the subchannel is not operational). It will clear any start or halt function, and it will clear any pending I/O interrupt.
There is an inherent race condition when issuing any of these instructions: The scsw after a stsch may indicate that a start/halt is still in progress, but the subchannel may have become status pending with final status immediately afterwards. This needs careful serialization so that we don't get confused in the state machine and present a consistent status to the guest.
Linux patches included in Linux 5.2; QEMU patches included in QEMU 4.1.
CANCEL SUBCHANNEL
This instruction is used to cancel a start operation that has been accepted by the subchannel but not yet started executing. We have the same race condition as with hsch/csch. A major difference is that xsch will not generate an interrupt, nor will the guest get an interrupt for the ssch it issued.
The easiest option would be to give the guest a cc 2 in any case: That covers both
- guest did a ssch/hsch, but did not get a status yet
- guest did nothing (subchannel idle)
STORE SUBCHANNEL
The guest only gets QEMU's view of the subchannel when it executes stsch. It may want the hardware's view instead. Either pass this through, or trigger QEMU to update its view.
MODIFY SUBCHANNEL
Enable/disable is handled by QEMU (we keep the real subchannel enabled during usage by the mdev framework). Things become complicated if we want to support channel monitoring (currently emulated for virtio-ccw devices in QEMU).
SET CHANNEL MONITOR
Not a per-subchannel command, currently emulated by QEMU. This becomes hairy when we want to deal with both passthrough and emulated devices.
TEST PENDING INTERRUPTION and TEST SUBCHANNEL
It's probably fine to leave them as-is (emulated), as I/O interrupts have to be managed by the host anyway. We may need to make sure the control blocks are updated by tsch correctly (and avoid interference with other instructions).
RESUME SUBCHANNEL
It is unclear how rsch can work with the current infrastructure.
Interface considerations
Current status
For ssch processing, we added an I/O region:
struct ccw_io_region {
#define ORB_AREA_SIZE 12
__u8 orb_area[ORB_AREA_SIZE];
#define SCSW_AREA_SIZE 12
__u8 scsw_area[SCSW_AREA_SIZE];
#define IRB_AREA_SIZE 96
__u8 irb_area[IRB_AREA_SIZE];
__u32 ret_code;
} __packed;
However, this has some problems:
- Semantics of the fields are unclear (is the scsw_area supposed to contain the copy of an scsw, or is it used to convey a command by specifying the start function in the fctl field?)
- While this is probably extensible for halt/clear handling, other commands may not work so well.
- We mix up sending a command from user space to the vfio module and handling a status.
This is unfortunately not really well documented, either (Documentation/s390/vfio-ccw.txt only states that "scsw_area should be filled with the SCSW of the Virtual Subchannel").
In the future, this region should not be used for anything new; regions guarded via capabilities are a better choice.
Add documentation
Whatever else we do, we need to document everything properly:
- Documentation/s390/vfio-ccw.txt should be more detailed/precise
- More comments regarding the interface in the code
- Anything else?
More I/O regions
Additional I/O regions for kernel/user space communication seem to be the way to go. They need to be guarded via capabilities, making the interface easily extensible.
A command area has been added for halt/clear handling.
Further regions that have been proposed:
- Status area (containing scsw/pmcw/...)
- CCW area
- Measurements/statistics (for channel measurements etc.)
Big picture items
These may affect more than one guest at a time.
Channel path handling
[looked at by farman]
- Who manages the paths (including path grouping), the guest or the host?
- How do we reflect path changes to the guest?
- Do we need special handling for DASD reserve/release?
Tracepoints
[looked at by farman]
We need some more of these, at strategic points.
Handling machine checks
This partially ties into path handling.
- If we implement something like the ccw device 'disconnected' state, we need to relay IPI CRWs to the guest (both for 'device gone' and 'device operational again').
- We need to decide to what extent we want to relay path-related CRWs to the guest (z/VM, for example, usually does not send these to guests).
- There's also a machine check that indicates (path-related) information is available via CHSC, but there's no public documentation. Additionally, the same thing as for the last point applies.
Performance and scalability
- We need to make sure that we don't include artificial bottlenecks.
- Specifically, we need to check that the BQL isn't introducing scaling issues for many vfio-ccw devices.
Further features
Support for migration
[looked at by cohuck]
Migration for other types of vfio devices has been discussed already. Currently, we require to unplug all passthrough devices before we can migrate.
This should leverage a common framework for migration of vfio devices, to avoid code duplication etc. That framework is in the process of being designed -- we need to make sure that it accommodates our use case as well.
Current proposals by Intel and NVIDIA look extendable to ccw as well. In fact, ccw is probably easier to implement than pci.
Transport mode
We currently only support command mode (cf. the name 'ccw'). It might be feasible to support transport mode as well (handling tcws etc.), but it is unsure what benefit that would bring other than enabling special guests.
The code base might benefit from cleaning it up to a point that transport mode is cleanly split from command mode, though.
Things we won't support
This includes things that are either not really feasible, or would mean a lot of effort for little gain.
Non-I/O subchannels
No public documentation for CHSC, message, or EADM subchannels is available, and we don't know about possible pitfalls.
I/O subchannels in QDIO mode
No public documentation for QDIO is available, either.
Testing
All testing is currently performed manually; we should aim for some integration into testing frameworks, even if some setup still can't be automated.
KVM unit tests
It might be possible to integrate into the KVM unit tests framework as a 'nodefault' test, if we depend on setting up the host correctly and pass parameters like the subchannel id to the test.
Who is currently looking at what?
cohuck: Cornelia Huck <cohuck AT redhat dot com>
- migration
farman: Eric Farman <farman AT ibm>
- channel path handling
- tracepoints