KVM, QEMU and Big Iron: 2018

Tuesday, December 4, 2018

Notes from KVM Forum 2018

KVM Forum 2018 took place October 24 - 26 in Edinburgh, Scotland. Better late than never, here are some of my notes and impressions. As always, there was a lot going on, and I could not attend everything that I would have found interesting. Fortunately, video recordings are available (see the page linked above, respectively the YouTube channel); here, I'd like to thank the folks organizing the logistics, recording the talks, and uploading nicely edited versions!

This year, KVM Forum was again co-located with OSS Europe, and on the first day (which also featured the annual QEMU summit), talks were on a shared track. This meant an opportunity for people attending OSS to hear some KVM and virtualization related talks; unfortunately, it also meant that the room where the KVM Forum talks were held was very crowded. Nevertheless, it is always nice if a talk is interesting enough to attract a good number of people; I'm happy that my maintainership talk also attracted a nice audience. Other talks from the first day I enjoyed were Alex' talk about L1TF and Marc's talk about running huge libvirt installations.

The second and third day featured some more comfortable rooms; organization-wise, I liked that talks about similar topics were grouped back-to-back.

On these days, we had the keynotes for KVM, QEMU, and libvirt; as well as the contributor Q&A panel - some good questions from the audience there. Also check out Christian's talk about the various architectures supported by KVM and how much commonality is there (or not).

Most of the time, days two and three were dual-track. Some of the topics covered were vfio and migration with vfio; nested virtualization; not-so-common architectures (including s390!); testing and continuous integration. I find it hard to point out specific sessions and recommend browsing through the posted videos instead.

Some topics were delved into more deeply in BOF sessions; myself, I attended the vfio migration BOF which gave me a couple of things to think about. Many BOF sessions subsequently posted summaries on the relevant mailing lists.

One of the most important features of any conference is, of course, the hallway track: Meeting new people, seeing old acquaintances again, and impromptu discussions about a lot of different topics. I find that this is one of the most valuable experiences, both for putting a face to a name and for discussing things you did not event think about beforehand.

So, for an even shorter summary of my short notes: KVM Forum 2018 was great, go watch some videos, and consider attending future KVM Forums :)

Wednesday, November 14, 2018

s390x changes in QEMU 3.1

QEMU is now in the -rc phase for 3.1, with a release expected in early/mid December, and, as usual, this is a good time to summarize the s390x changes for that release.

CPU models

s390x now supports the 'max' cpu model as well (which somehow had been forgotten...) When using KVM, this behaves like the 'host' model; when using TCG, this is the 'qemu' model plus some additional, experimental features. Note that this is neither static nor migration-safe.

Devices

Support for vfio-ap has been added. That allows to pass crypto cards on the AP bus to the guest. Support for this has been merged into the Linux kernel with 4.20. As this is a rather large feature, I plan to do a separate writeup for this.

KVM

Support for enabling huge page backing has been added. This requires a host kernel of version 4.19 or higher. Note that this is only available for the s390-ccw-virtio-3.1 or later machines (due to compat handling), and that it is as of writing this incompatible with nested virtualization (which should change in the future.)
Support for the etoken facility (spectre mitigation) has been added. This, as well, needs a host kernel of version 4.19 or higher.

TCG

Support for instruction flags and AFP registers has been added.

Miscellaneous

The deprecated 's390-squash-mcss' option has been removed.
And the usual fixes, cleanups and improvements.

Wednesday, August 1, 2018

s390x changes in QEMU 3.0

QEMU 3.0 is currently in the late -rc phase (with the final release expected early/mid August), so here's a quick summary of what has been changed for s390x.

CPU models

A CPU model for the z14 Model ZR1 has been added. This is the "small", single-frame z14.
The feature bits for Spectre mitigation (bpb and ppa15) are now included in the default CPU model for z196 and up. This means that these features will be available to the guest (given the host supports them) without needing to specify them explicitly.

Devices

You can now configure consoles via -serial as well.
vfio-ccw devices have gained a "force-orb-pfch" property. This is not very useful for Linux guests, but if you are trying to use vfio-ccw with a guest that does not specify "unlimited prefetch" for its requests but does not actually rely on the semantics, this will help you. Adding support to vfio-ccw to accommodate channel programs that must not be prefetched is unfortunately not straightforward and will not happen in the foreseeable future.

Booting and s390 bios

The s390-netboot image has been enhanced: It now supports indirect loading via .INS files and pxelinux.cfg-style booting.
The boot menu can now also deal with non-sequential entries.

Miscalleneous

Handling of the TOD clock in tcg has been improved; CPU hotplug under tcg is now working.
And the usual fixes, cleanups and improvements.

Thursday, May 3, 2018

A vfio-ccw primer

While basic support for vfio-ccw has been included in Linux and QEMU for some time, work has recently started to ramp up again and it seems like a good time to give some basic overview.

Why vfio-ccw?

Historically, QEMU on s390x presented paravirtualized virtio devices to the guest; first, via a protocol inspired by lguest, later, as emulated channel devices. This satisfies most needs (you get block devices, network devices, a console device, and lots more), but the device types are different from those found on LPARs or z/VM guests, and you may have a need to use e.g. a DASD directly.

For that reason, we want to do the same thing as on other platforms: pass a host device to the guest directly via vfio.

How does this work?

vfio-ccw is using the vfio mediated device framework; see the kernel documentation for an overview.

In a nutshell: The subchannel to be passed to the guest is unbound from its normal host driver (in this case, the I/O subchannel driver) and bound to the vfio-ccw driver. Any I/O request is intercepted and executed on the real device, and interrupts from the real device are relayed back to the guest.

Why subchannels and not ccw devices?

The initial attempt to implement this actually worked at the ccw device level. However, this means that the Linux common I/O layer in the host will perform various actions like handling of channel paths - which may interfere with what the guest is trying to do. Therefore, it seemed like a better idea to keep out of the way as much as possible and just implement a minimal subchannel driver that does not do much beyond what the guest actually triggered itself.

How is an actual I/O request processed?

When the guest is ready to use a channel device, it will issue I/O requests via channel programs (see here for an explanation on how that works and what things like scsw and orb mean.) The channel I/O instructions are mandatory SIE intercepts, so the host will get control for any START SUBCHANNEL the guest issues. QEMU is in charge of interpretation of channel I/O instructions, so it will process the ssch as a request to a pass-through device.

All channel I/O instructions are privileged, which means that the host kernel now needs to get involved again. QEMU does so by writing to an I/O region: the scsw (which contains, amongst other things, the fctl field specifying the start function) and the orb (pointing to the channel program). The host kernel driver now has enough information to actually issue the request on the real device after translating the ccw chain and its addresses to host addresses (involving pinning, idals and other things I will not explain here for brevity.)

After the device has processed the I/O request, it will make the subchannel status pending and generate an I/O interrupt. The host kernel driver collects the state and makes it available via the same I/O region (the IRB field), and afterwards triggers QEMU via an eventfd. QEMU now has all information needed to update its internal structures for the devices so that the guest can obtain the information related to the I/O request.

Isn't that all a bit too synchronous?

Yes, it is. Channel I/O is supposed to be asynchronous (give the device an I/O request, collect status later), but our implementation isn't yet. Why? Short answer: It is hard, and we wanted something to get us going. But this is on the list of things to be worked on.

Where is the IOMMU for this?

Due to the way channel programs work, we don't have a real IOMMU.

Does this cover everything supported by the architecture?

Not yet. Channel program wise, we support the format Linux drivers use. Also, we're emulating things like HALT SUBCHANNEL and CLEAR SUBCHANNEL in QEMU, while they really should be handed through to the device (support for this is in the works).

On the whole, you should be able to pass an ECKD DASD to a Linux guest without (known) issues.

How can I try this out?

Recent QEMU and Linux versions should have everything you need in the host; see this wiki entry for details. As a guest, any guest that can run under KVM should be fine.

What's the deal with that "unrestricted cssids" thing?

If you look at this older article, you'll notice the 'fe' value for the cssid of virtio devices (with the promise to explain it later... which I sadly never did). The basic idea at the time was to put 'virtual' devices like virtio and 'non-virtual' devices like vfio-ccw into different channel subsystem images, so that e.g. channel paths (which are per channel subsystem image) don't clash. In other words, 'virtual' and 'non-virtual' devices (and channel paths) would have different cssids (the first part of their identifiers).

This sounded like a good idea at the time; however, there's a catch: A guest operating system will by default only see the devices in the default channel subsystem image. To see all of them, it needs to explicitly enable the Multiple Channel Subsystems Extended (MCSS-E) feature - and I do not know of any operating system that has done so as of today (not very surprising, as QEMU is the only implementation of MCSS-E I'm aware of).

To work around this, we originally introduced the 's390-squash-mcss' parameter to QEMU, which would put all devices into the default channel subsystem image. But as MCSS-E support is unlikely to arrive in any guest operating system anytime soon, we agreed to rather drop the restriction of virtual devices being in css fe and non-virtual devices everywhere else (since QEMU 2.12).

What are the plans for the future?

Several things are already actively worked on, while others may come up later.

Intial libvirt support for vfio-ccw has been posted here.
Reworking the Linux host driver to make things more asynchronous and to support halt/clear is in progress.
Improvements in channel path handling (for example, to enable the guest to see path availability changes) are also in progress. We may need to consider things like dasd reserve/release as well.

Monday, March 26, 2018

s390x changes in QEMU 2.12

As QEMU is now in hard freeze for 2.12 (with the final release expected in mid/late April), now is a good point in time to summarize some of the changes that made it into QEMU 2.12 for s390x.

I/O Devices

Channel I/O: Any device can now be put into any channel subsystem image, regardless whether it is a virtual device (like virtio-ccw) or a device passed through via vfio-ccw. This obsoletes the s390-squash-mcss option (which was needed to explicitly squash vfio-ccw devices into the default channel subsystem image in order to make it visible to guests not enabling MCSS-E).
PCI: Fixes and refactoring, including handling of subregions. This enables usage of virtio-pci devices on s390x (although only if MSI-X is enabled, as s390x depends on it.) Previously, you could add virtio-pci devices on s390x, but they were not usable.
For more information about PCI, see this blog entry.

Booting and s390-ccw bios

Support for an interactive boot menu. Note that this is a bit different than on other architectures (although it hooks into the same infrastructure). The boot menu is written on the (virtual) disk via the 'zipl' program, and these entries need to be parsed and displayed via SCLP.

System Emulation

KVM: In case you were short on memory before: You can now run guests with 8 TB or more.
KVM: Support for the bpb and ppa15 CPU features (for spectre mitigation). These have been backported to 2.11.1 as well.
TCG: Lots of improvements: Implementation of missing instructions, full (non-experimental) SMP support.
TCG: Improvements in handling of the STSI instruction (you can look at some information obtained that way via /proc/sysinfo.) Note that a TCG guest reports itself as a KVM guest, rather than an LPAR: In many ways, a TCG guest is closer to KVM, and reporting itself as an LPAR makes the Linux guest code choose an undesired target for its console output by default.
TCG: Wire up the zPCI instructions; you can now use virtio-pci devices under TCG.
CPU models: Switch the 'qemu' model to a stripped-down z12, adding all features required by kernels on recent distributions. This means that you can now run recent distributions (Fedora 26/27, Ubuntu 18.04, ...) under TCG. Older distributions may not work (older kernels required some features not implemented under TCG), unless they were built for a z900 like Debian stable.

Miscellaneous

Support for memory hotplug via SCLP has been removed. This was an odd interface: Unlike as on other architectures, the guest could enable 'standby' memory if it had been supplied. Another problem was that this never worked with migration. Old command lines will continue to work, but no 'standby' memory will be available to the guest any more.
Memory hotplug on s390x will probably come back in the future with an interface that matches better what is done elsewhere, likely via some paravirtualized interface. Support for the SCLP interface might come back in the future as well, implemented in an architecture-specific way that does not try to look like memory hotplug elsewhere.
And of course, the usual fixes, cleanups and other improvements.

Monday, February 19, 2018

Notes on PCI on s390x

As QEMU 2.12 will finally support PCI devices under s390x/tcg, I thought now is a good time to talk about some of the peculiarities of PCI on the mainframe.

Oddities of PCI on s390x architecture

Oddity #1: No MMIO, but instructions

Everywhere else, you use MMIO when interacting with PCI devices. Not on s390x; you have a set of instructions instead. For example, if you want to read or write memory, you will need to use the PCILG or PCISTG instructions, and for refreshing translations, you will need to use the RPCIT instructions. Fortunately, these instructions can be matched to the primitives in the Linux kernel; unfortunately, all those instructions are privileged, which leads us to

Oddity #2: No user space I/O

As any interaction with PCI devices needs to be done through privileged instructions, Linux user space can't interact with the devices directly; the Linux kernel needs to be involved in every case. This means that there are none of the PCI user space implementations popular on other platforms available on s390x.

Oddity #3: No topology, but FID and UID

Usually, you'll find busses, slots and functions when you identify a certain PCI function. The PCI instructions on s390x, however, don't expose any topology to the caller. This means that an operating system will get a simple list of functions, with a function id (FID) that can be mapped to a physical slot and an UID, which the Linux kernel will map to a domain number. A PCI identifier under Linux on s390x will therefore always be of the form <domain>:00:00.0.

Implications for the QEMU implementation of PCI on s390x

In order to support PCI on s390x in QEMU, some specialties had to be implemented.

Instruction handlers

Under KVM, every PCI instruction intercepts and is routed to user space. QEMU does the heavy lifting of emulating the operations and mapping to generic PCI code. This also implied that PCI under tcg did not work until the instructions had been wired up; this has now finally happened and will be in the 2.12 release.

Modelling and (lack of) topology

QEMU PCI code expects the normal topology present on other platforms. However, this (made-up) topology will be invisible to guests, as the PCI instructions do not relay it. Instead, there is a special "zpci" device with "fid" and "uid" properties that can be linked to a normal PCI device. If no zpci device is specified, QEMU will autogenerate the FID and the UID.

How can I try this out?

If you do not have a real mainframe with a real PCI card, you can use virtio-pci devices as of QEMU 2.12 (or current git as of the time of this writing). If you do have a mainframe and a PCI card, you can use vfio-pci (but not yet via libvirt).

Here's an example of how to specify a virtio-net-pci device for s390x, using tcg:

s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on (...) -device zpci,uid=12,fid=2,target=vpci02,id=zpci2 -device virtio-net-pci,id="vpci02",addr=0x2

Some notes on this:

You need to explicitly enable the "zpci" feature in the qemu cpu model. Two other features, "aen" and "ais", are enabled by default ("aen" and "zpci" are mandatory, "ais" is needed for Linux guest kernels prior to 4.15. If you use KVM, the host kernel also needs support for ais.)
The zpci device is joined with the PCI device via the "target" property. The virtio-net-pci device does not know anything about zpci devices.
Only virtio-pci devices using MSI-X will work on s390x.

In the guest, this device will show up in lspci -v as

000c:00:00.0 Ethernet controller: Red Hat, Inc. Virtio network device
Subsystem: Red Hat, Inc. Device 0001
Physical Slot: 00000002

Note how the uid of 12 shows up as domain 000c and the fid of 2 as physical slot 00000002.