KVM, QEMU and Big Iron: A vfio-ccw primer

While basic support for vfio-ccw has been included in Linux and QEMU for some time, work has recently started to ramp up again and it seems like a good time to give some basic overview.

Why vfio-ccw?

Historically, QEMU on s390x presented paravirtualized virtio devices to the guest; first, via a protocol inspired by lguest, later, as emulated channel devices. This satisfies most needs (you get block devices, network devices, a console device, and lots more), but the device types are different from those found on LPARs or z/VM guests, and you may have a need to use e.g. a DASD directly.

For that reason, we want to do the same thing as on other platforms: pass a host device to the guest directly via vfio.

How does this work?

vfio-ccw is using the vfio mediated device framework; see the kernel documentation for an overview.

In a nutshell: The subchannel to be passed to the guest is unbound from its normal host driver (in this case, the I/O subchannel driver) and bound to the vfio-ccw driver. Any I/O request is intercepted and executed on the real device, and interrupts from the real device are relayed back to the guest.

Why subchannels and not ccw devices?

The initial attempt to implement this actually worked at the ccw device level. However, this means that the Linux common I/O layer in the host will perform various actions like handling of channel paths - which may interfere with what the guest is trying to do. Therefore, it seemed like a better idea to keep out of the way as much as possible and just implement a minimal subchannel driver that does not do much beyond what the guest actually triggered itself.

How is an actual I/O request processed?

When the guest is ready to use a channel device, it will issue I/O requests via channel programs (see here for an explanation on how that works and what things like scsw and orb mean.) The channel I/O instructions are mandatory SIE intercepts, so the host will get control for any START SUBCHANNEL the guest issues. QEMU is in charge of interpretation of channel I/O instructions, so it will process the ssch as a request to a pass-through device.

All channel I/O instructions are privileged, which means that the host kernel now needs to get involved again. QEMU does so by writing to an I/O region: the scsw (which contains, amongst other things, the fctl field specifying the start function) and the orb (pointing to the channel program). The host kernel driver now has enough information to actually issue the request on the real device after translating the ccw chain and its addresses to host addresses (involving pinning, idals and other things I will not explain here for brevity.)

After the device has processed the I/O request, it will make the subchannel status pending and generate an I/O interrupt. The host kernel driver collects the state and makes it available via the same I/O region (the IRB field), and afterwards triggers QEMU via an eventfd. QEMU now has all information needed to update its internal structures for the devices so that the guest can obtain the information related to the I/O request.

Isn't that all a bit too synchronous?

Yes, it is. Channel I/O is supposed to be asynchronous (give the device an I/O request, collect status later), but our implementation isn't yet. Why? Short answer: It is hard, and we wanted something to get us going. But this is on the list of things to be worked on.

Where is the IOMMU for this?

Due to the way channel programs work, we don't have a real IOMMU.

Does this cover everything supported by the architecture?

Not yet. Channel program wise, we support the format Linux drivers use. Also, we're emulating things like HALT SUBCHANNEL and CLEAR SUBCHANNEL in QEMU, while they really should be handed through to the device (support for this is in the works).

On the whole, you should be able to pass an ECKD DASD to a Linux guest without (known) issues.

How can I try this out?

Recent QEMU and Linux versions should have everything you need in the host; see this wiki entry for details. As a guest, any guest that can run under KVM should be fine.

What's the deal with that "unrestricted cssids" thing?

If you look at this older article, you'll notice the 'fe' value for the cssid of virtio devices (with the promise to explain it later... which I sadly never did). The basic idea at the time was to put 'virtual' devices like virtio and 'non-virtual' devices like vfio-ccw into different channel subsystem images, so that e.g. channel paths (which are per channel subsystem image) don't clash. In other words, 'virtual' and 'non-virtual' devices (and channel paths) would have different cssids (the first part of their identifiers).

This sounded like a good idea at the time; however, there's a catch: A guest operating system will by default only see the devices in the default channel subsystem image. To see all of them, it needs to explicitly enable the Multiple Channel Subsystems Extended (MCSS-E) feature - and I do not know of any operating system that has done so as of today (not very surprising, as QEMU is the only implementation of MCSS-E I'm aware of).

To work around this, we originally introduced the 's390-squash-mcss' parameter to QEMU, which would put all devices into the default channel subsystem image. But as MCSS-E support is unlikely to arrive in any guest operating system anytime soon, we agreed to rather drop the restriction of virtual devices being in css fe and non-virtual devices everywhere else (since QEMU 2.12).

What are the plans for the future?

Several things are already actively worked on, while others may come up later.

Intial libvirt support for vfio-ccw has been posted here.
Reworking the Linux host driver to make things more asynchronous and to support halt/clear is in progress.
Improvements in channel path handling (for example, to enable the guest to see path availability changes) are also in progress. We may need to consider things like dasd reserve/release as well.

Thursday, May 3, 2018

A vfio-ccw primer