Wednesday, July 10, 2019

s390x changes in QEMU 4.1

QEMU has just entered hard freeze for 4.1, so the time is here again to summarize the s390x changes for that release.

TCG

  • All instructions that have been introduced with the "Vector Facility" in the z13 machines are now emulated by QEMU. In particular, this allows Linux distributions built for z13 or later to be run under TCG (vector instructions are generated when we compile for z13; other z13 facilities are optional.)

CPU Models

  • As the needed prerequisites in TCG now have been implemented, the "qemu" cpu model now includes the "Vector Facility" and has been bumped to a stripped-down z13.
  • Models for the upcoming gen15 machines (the official name is not yet known) and some new facilities have been added.
  • If the host kernel supports it, we now indicate the AP Queue Interruption facility. This is used by vfio-ap and allows to provide interrupts for AP to the guest.

I/O Devices

  • vfio-ccw has gained support for relaying HALT SUBCHANNEL and CLEAR SUBCHANNEL requests from the guest to the device, if the host kernel vfio-ccw driver supports it. Otherwise, these instructions continue to be emulated by QEMU, as before.
  • The bios now supports IPLing (booting) from DASD attached via vfio-ccw.

Booting

  • The bios tolerates signatures written by zipl, if present; but it does not actually handle them. See the 'secure' option for zipl introduced in s390-tools 2.9.0.
And the usual fixes and cleanups.

Tuesday, March 12, 2019

s390x changes in QEMU 4.0

QEMU is now entering softfreeze for the 4.0 release (expected in April), so here is the usual summary of s390x changes in that release.

CPU Models

  • A cpu model for the z14 GA 2 has been added. Currently, no new features have been added.
  • The cpu model for z14 now does, however, include the multiple epoch and PTFF enhancement features per default.
  • The 'qemu' cpu model now includes the zPCI feature per default. No more prerequisites are needed for pci support (see below).

Devices


  • QEMU for s390x is now always built with pci support. If we want to provide backwards compatibility,  we cannot simply disable pci (we need the s390 pci host bus); it is easier to simply make pci mandatory. Note that disabling pci was never supported by the normal build system anyway.
  • zPCI devices have gained support for instruction counters (on a Linux guest, these are exposed through /sys/kernel/debug/pci/<function>/statistics).
  • zPCI devices always lacked support for migrating their s390-specific state (not implemented...); if you tried to migrate a guest with a virtio-pci device on s390x, odd things might happen. To avoid surprises, the 'zpci' devices are now explicitly marked as unmigratable. (Support for migration will likely be added in the future.)
  • Hot(un)plug of the vfio-ap matrix device is now supported.
  • Adding a vfio-ap matrix device no longer inhibits usage of a memory ballooner: Memory usage by vfio-ap does not clash with the concept of a memory balloon.

TCG

  • Support for the floating-point extension facility has been added.
  • The first part of support for z13 vector instructions has been added (vector support instructions). Expect support for the remaining vector instructions in the next release; it should support enough of the instructions introduced with z13 to be able to run a distribution built for that cpu. 

Tuesday, December 4, 2018

Notes from KVM Forum 2018

KVM Forum 2018 took place October 24 - 26 in Edinburgh, Scotland. Better late than never, here are some of my notes and impressions. As always, there was a lot going on, and I could not attend everything that I would have found interesting. Fortunately, video recordings are available (see the page linked above, respectively the YouTube channel); here, I'd like to thank the folks organizing the logistics, recording the talks, and uploading nicely edited versions!

This year, KVM Forum was again co-located with OSS Europe, and on the first day (which also featured the annual QEMU summit), talks were on a shared track. This meant an opportunity for people attending OSS to hear some KVM and virtualization related talks; unfortunately, it also meant that the room where the KVM Forum talks were held was very crowded. Nevertheless, it is always nice if a talk is interesting enough to attract a good number of people; I'm happy that my maintainership talk also attracted a nice audience. Other talks from the first day I enjoyed were Alex' talk about L1TF and Marc's talk about running huge libvirt installations.

The second and third day featured some more comfortable rooms; organization-wise, I liked that talks about similar topics were grouped back-to-back.

On these days, we had the keynotes for KVM, QEMU, and libvirt; as well as the contributor Q&A panel - some good questions from the audience there. Also check out Christian's talk about the various architectures supported by KVM and how much commonality is there (or not).

Most of the time, days two and three were dual-track. Some of the topics covered were vfio and migration with vfio; nested virtualization; not-so-common architectures (including s390!); testing and continuous integration. I find it hard to point out specific sessions and recommend browsing through the posted videos instead.

Some topics were delved into more deeply in BOF sessions; myself, I attended the vfio migration BOF which gave me a couple of things to think about. Many BOF sessions subsequently posted summaries on the relevant mailing lists.

One of the most important features of any conference is, of course, the hallway track: Meeting new people, seeing old acquaintances again, and impromptu discussions about a lot of different topics. I find that this is one of the most valuable experiences, both for putting a face to a name and for discussing things you did not event think about beforehand.

So, for an even shorter summary of my short notes: KVM Forum 2018 was great, go watch some videos, and consider attending future KVM Forums :)

Wednesday, November 14, 2018

s390x changes in QEMU 3.1

QEMU is now in the -rc phase for 3.1, with a release expected in early/mid December, and, as usual, this is a good time to summarize the s390x changes for that release.

CPU models

  • s390x now supports the 'max' cpu model as well (which somehow had been forgotten...) When using KVM, this behaves like the 'host' model; when using TCG, this is the 'qemu' model plus some additional, experimental features. Note that this is neither static nor migration-safe.

Devices

  • Support for vfio-ap has been added. That allows to pass crypto cards on the AP bus to the guest. Support for this has been merged into the Linux kernel with 4.20. As this is a rather large feature, I plan to do a separate writeup for this.

KVM

  • Support for enabling huge page backing has been added. This requires a host kernel of version 4.19 or higher. Note that this is only available for the s390-ccw-virtio-3.1 or later machines (due to compat handling), and that it is as of writing this incompatible with nested virtualization (which should change in the future.)
  • Support for the etoken facility (spectre mitigation) has been added. This, as well, needs a host kernel of version 4.19 or higher.

TCG

  • Support for instruction flags and AFP registers has been added.

Miscellaneous

  • The deprecated 's390-squash-mcss' option has been removed.
  • And the usual fixes, cleanups and improvements.

Wednesday, August 1, 2018

s390x changes in QEMU 3.0

QEMU 3.0 is currently in the late -rc phase (with the final release expected early/mid August), so here's a quick summary of what has been changed for s390x.

CPU models

  • A CPU model for the z14 Model ZR1 has been added. This is the "small", single-frame z14.
  • The feature bits for Spectre mitigation (bpb and ppa15) are now included in the default CPU model for z196 and up. This means that these features will be available to the guest (given the host supports them) without needing to specify them explicitly.

Devices

  • You can now configure consoles via -serial as well.
  • vfio-ccw devices have gained a "force-orb-pfch" property. This is not very useful for Linux guests, but if you are trying to use vfio-ccw with a guest that does not specify "unlimited prefetch" for its requests but does not actually rely on the semantics, this will help you. Adding support to vfio-ccw to accommodate channel programs that must not be prefetched is unfortunately not straightforward and will not happen in the foreseeable future.

Booting and s390 bios

  • The s390-netboot image has been enhanced: It now supports indirect loading via .INS files and pxelinux.cfg-style booting.
  • The boot menu can now also deal with non-sequential entries.

Miscalleneous

  • Handling of the TOD clock in tcg has been improved; CPU hotplug under tcg is now working.
  • And the usual fixes, cleanups and improvements.

Thursday, May 3, 2018

A vfio-ccw primer

While basic support for vfio-ccw has been included in Linux and QEMU for some time, work has recently started to ramp up again and it seems like a good time to give some basic overview.

Why vfio-ccw?

Historically, QEMU on s390x presented paravirtualized virtio devices to the guest; first, via a protocol inspired by lguest, later, as emulated channel devices. This satisfies most needs (you get block devices, network devices, a console device, and lots more), but the device types are different from those found on LPARs or z/VM guests, and you may have a need to use e.g. a DASD directly.

For that reason, we want to do the same thing as on other platforms: pass a host device to the guest directly via vfio.

How does this work?

vfio-ccw is using the vfio mediated device framework; see the kernel documentation for an overview.

In a nutshell: The subchannel to be passed to the guest is unbound from its normal host driver (in this case, the I/O subchannel driver) and bound to the vfio-ccw driver. Any I/O request is intercepted and executed on the real device, and interrupts from the real device are relayed back to the guest.

Why subchannels and not ccw devices?

The initial attempt to implement this actually worked at the ccw device level. However, this means that the Linux common I/O layer in the host will perform various actions like handling of channel paths - which may interfere with what the guest is trying to do. Therefore, it seemed like a better idea to keep out of the way as much as possible and just implement a minimal subchannel driver that does not do much beyond what the guest actually triggered itself.

How is an actual I/O request processed?

When the guest is ready to use a channel device, it will issue I/O requests via channel programs (see here for an explanation on how that works and what things like scsw and orb mean.) The channel I/O instructions are mandatory SIE intercepts, so the host will get control for any START SUBCHANNEL the guest issues. QEMU is in charge of interpretation of channel I/O instructions, so it will process the ssch as a request to a pass-through device.

All channel I/O instructions are privileged, which means that the host kernel now needs to get involved again. QEMU does so by writing to an I/O region: the scsw (which contains, amongst other things, the fctl field specifying the start function) and the orb (pointing to the channel program). The host kernel driver now has enough information to actually issue the request on the real device after translating the ccw chain and its addresses to host addresses (involving pinning, idals and other things I will not explain here for brevity.)

After the device has processed the I/O request, it will make the subchannel status pending and generate an I/O interrupt. The host kernel driver collects the state and makes it available via the same I/O region (the IRB field), and afterwards triggers QEMU via an eventfd. QEMU now has all information needed to update its internal structures for the devices so that the guest can obtain the information related to the I/O request.

Isn't that all a bit too synchronous?

Yes, it is. Channel I/O is supposed to be asynchronous (give the device an I/O request, collect status later), but our implementation isn't yet. Why? Short answer: It is hard, and we wanted something to get us going. But this is on the list of things to be worked on.

Where is the IOMMU for this?

Due to the way channel programs work, we don't have a real IOMMU.

Does this cover everything supported by the architecture?

Not yet. Channel program wise, we support the format Linux drivers use. Also, we're emulating things like HALT SUBCHANNEL and CLEAR SUBCHANNEL in QEMU, while they really should be handed through to the device (support for this is in the works).

On the whole, you should be able to pass an ECKD DASD to a Linux guest without (known) issues.

How can I try this out?

Recent QEMU and Linux versions should have everything you need in the host; see this wiki entry for details. As a guest, any guest that can run under KVM should be fine.

What's the deal with that "unrestricted cssids" thing?

If you look at this older article, you'll notice the 'fe' value for the cssid of virtio devices (with the promise to explain it later... which I sadly never did). The basic idea at the time was to put 'virtual' devices like virtio and 'non-virtual' devices like vfio-ccw into different channel subsystem images, so that e.g. channel paths (which are per channel subsystem image) don't clash. In other words, 'virtual' and 'non-virtual' devices (and channel paths) would have different cssids (the first part of their identifiers).

This sounded like a good idea at the time; however, there's a catch: A guest operating system will by default only see the devices in the default channel subsystem image. To see all of them, it needs to explicitly enable the Multiple Channel Subsystems Extended (MCSS-E) feature - and I do not know of any operating system that has done so as of today (not very surprising, as QEMU is the only implementation of MCSS-E I'm aware of).

To work around this, we originally introduced the 's390-squash-mcss' parameter to QEMU, which would put all devices into the default channel subsystem image. But as MCSS-E support is unlikely to arrive in any guest operating system anytime soon, we agreed to rather drop the restriction of virtual devices being in css fe and non-virtual devices everywhere else (since QEMU 2.12).

What are the plans for the future?

Several things are already actively worked on, while others may come up later.
  • Intial libvirt support for vfio-ccw has been posted here.
  • Reworking the Linux host driver to make things more asynchronous and to support halt/clear is in progress.
  • Improvements in channel path handling (for example, to enable the guest to see path availability changes) are also in progress. We may need to consider things like dasd reserve/release as well.

Monday, March 26, 2018

s390x changes in QEMU 2.12

As QEMU is now in hard freeze for 2.12 (with the final release expected in mid/late April), now is a good point in time to summarize some of the changes that made it into QEMU 2.12 for s390x.

I/O Devices

  • Channel I/O: Any device can now be put into any channel subsystem image, regardless whether it is a virtual device (like virtio-ccw) or a device passed through via vfio-ccw. This obsoletes the s390-squash-mcss option (which was needed to explicitly squash vfio-ccw devices into the default channel subsystem image in order to make it visible to guests not enabling MCSS-E).
  • PCI: Fixes and refactoring, including handling of subregions. This enables usage of virtio-pci devices on s390x (although only if MSI-X is enabled, as s390x depends on it.) Previously, you could add virtio-pci devices on s390x, but they were not usable.
    For more information about PCI, see this blog entry.

Booting and s390-ccw bios

  • Support for an interactive boot menu. Note that this is a bit different than on other architectures (although it hooks into the same infrastructure). The boot menu is written on the (virtual) disk via the 'zipl' program, and these entries need to be parsed and displayed via SCLP.

System Emulation

  • KVM: In case you were short on memory before: You can now run guests with 8 TB or more.
  • KVM: Support for the bpb and ppa15 CPU features (for spectre mitigation). These have been backported to 2.11.1 as well.
  • TCG: Lots of improvements: Implementation of missing instructions, full (non-experimental) SMP support.
  • TCG: Improvements in handling of the STSI instruction (you can look at some information obtained that way via /proc/sysinfo.) Note that a TCG guest reports itself as a KVM guest, rather than an LPAR: In many ways, a TCG guest is closer to KVM, and reporting itself as an LPAR makes the Linux guest code choose an undesired target for its console output by default.
  • TCG: Wire up the zPCI instructions; you can now use virtio-pci devices under TCG.
  • CPU models: Switch the 'qemu' model to a stripped-down z12, adding all features required by kernels on recent distributions. This means that you can now run recent distributions (Fedora 26/27, Ubuntu 18.04, ...) under TCG. Older distributions may not work (older kernels required some features not implemented under TCG), unless they were built for a z900 like Debian stable.

Miscellaneous

  • Support for memory hotplug via SCLP has been removed. This was an odd interface: Unlike as on other architectures, the guest could enable 'standby' memory if it had been supplied. Another problem was that this never worked with migration. Old command lines will continue to work, but no 'standby' memory will be available to the guest any more.
    Memory hotplug on s390x will probably come back in the future with an interface that matches better what is done elsewhere, likely via some paravirtualized interface. Support for the SCLP interface might come back in the future as well, implemented in an architecture-specific way that does not try to look like memory hotplug elsewhere.
  • And of course, the usual fixes, cleanups and other improvements.