Friday, July 10, 2020

s390x changes in QEMU 5.1

QEMU has entered softfreeze for 5.1, so it is time to summarize the s390x changes in that version.

Protected virtualization

One of the biggest features on the s390/KVM side in Linux 5.7 had been protected virtualization aka secure execution, which basically restricts the (untrusted) hypervisor from accessing all of the guest's memory and delegates many tasks to the (trusted) ultravisor. QEMU 5.1 introduces the QEMU part of the feature.
In order to be able to run protected guests, you need to run on a z15 or a Linux One III, with at least a 5.7 kernel. You also need an up-to-date s390-tools installation. Some details are available in the QEMU documentation. For more information about what protected virtualization is, watch this talk from KVM Forum 2019 and this talk from 36C3.

vfio-ccw

vfio-ccw has also seen some improvements over the last release cycle.
  • Requests that do not explicitly allow prefetching in the ORB are no longer rejected out of hand (although the kernel may still do so, if you run a pre-5.7 version.) The rationale behind this is that most device drivers never modify their channel programs dynamically, and the one common code path that does (IPL from DASD) is already accommodated by the s390-ccw bios. While you can instruct QEMU to ignore the prefetch requirement for selected devices, this is an additional administrative complication for little benefit; it is therefore no longer required.
  • In order to be able to relay changes in channel path status to the guest, two new regions have been added: a schib region to relay real data to stsch, and a crw region to relay channel reports. If, for example, a channel path is varied off on the host, all guests using a vfio-ccw device that uses this channel path now get a proper channel report for it.

Other changes

Other than the bigger features mentioned above, there have been the usual fixes, improvements, and cleanups, both in the main s390x QEMU code and in the s390-ccw bios.

Wednesday, April 8, 2020

s390x changes in QEMU 5.0

QEMU is currently in hardfreeze, with the 5.0 release expected at the end of the month. Here's a quick list of some notable s390x changes.

  • You can finally enable Adapter Interrupt Suppression in the cpu model (ais=on) when running under KVM. This had been working under TCG for some time now, but KVM was missing an interface that was provided later -- and we finally actually check for that interface in QEMU. This is mostly interesting for PCI.
  • QEMU had been silently fixing odd memory sizes to something that can be reported via SCLP for some time. Silently changing user input is probably not such a good idea; compat machines will continue to do so to enable migration from old QEMUs for machines with odd sizes, but will print a warning now. If you have such an old machine (and you can modify it), it might be a good idea to either specify the memory size it gets rounded to or to switch to the 5.0 machine type, where memory sizes can be more finegrained due to the removal of support for memory hotplug. We may want to get rid of the code doing the fixup at some time in the future.
  • QEMU now properly performs the whole set of initial, clear, and normal cpu reset.
  • And the usual fixes, cleanups, and improvements.
For 5.1, expect more changes; support for protected virtualization will be a big item.

Wednesday, January 22, 2020

Channel Measurements: A Quick Overview

The s390 channel subsystem can gather some statistics on I/O performance for you, which might be useful if you try to figure out why something is not performing as well as you'd expect it to be. From a QEMU/KVM perspective, this is currently mainly useful on the host.

Channel monitoring for ccw devices

The first kind of channel measurements is those collected per subchannel. For a detailed overview of what actually happens there, turn to the Principles of Operation, Chapter 17 ("I/O Support Functions"), "Channel Monitoring". I'll cover here what will most likely be of interest to people running a Linux (host) system.

Enabling channel measurements

If you a running a non-vintage machine (i.e. a z990 or later), you will not need a system-wide setup. Older machines should be fine as well, if you do not want to measure more than 1024 devices.

To enable measurements for a specific ccw device (say, 0.0.1234), simply issue:

chccwdev -a cmb_enable=1 0.0.1234

Measurements collected

Under /sys/bus/ccw/device/0.0.1234/, you should now have a new subdirectory called cmf, which contains some files. For a system that has been running for some time, the contents may look something like the following:

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
829031
==> cmf/avg_device_disconnect_time <==
398526
==> cmf/avg_function_pending_time <==
142810
==> cmf/avg_initial_command_response_time <==
19170
==> cmf/avg_sample_interval <==
8401681344
==> cmf/avg_utilization <==
00.0%
==> cmf/sample_count <==
10803
==> cmf/ssch_rsch_count <==
10803

Note that all values but sample_count and ssch_rsch_count are averaged over time. We also see that samples seem to have been taken whenever the driver issued a ssch.

The device in our example shows an avg_utilization of 0%, which is consistent with a device that mostly sits idle. But what about a device where something is actually happening?

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
58454
==> cmf/avg_device_disconnect_time <==
16743818
==> cmf/avg_function_pending_time <==
99322
==> cmf/avg_initial_command_response_time <==
20284
==> cmf/avg_sample_interval <==
153014636
==> cmf/avg_utilization <==
11.0%
==> cmf/sample_count <==
1281
==> cmf/ssch_rsch_count <==
1281

Here, we see a higher avg_utilization, but actually not that many ssch invocations. Interesting is the relatively high value of avg_device_disconnect_time: It indicates that there are quite long intervals where the device and the channel subsystem do not talk to each other. That might, for example, happen if other LPARs on the same system drive a lot of I/O via the same channel paths as the device.

Help, I cannot enable channel measurements on my device!

There's one drawback when trying to enable channel measurements on a live device: It needs to execute a msch, which only can be done on an idle subchannel. For devices that execute separate ssch invocations to go about their business (e.g. dasd), the common I/O layer can squeeze in the msch between ssch invocations and all is well. However, some devices use a long-running channel program, which will not conclude during the time the device is enabled; the most prominent example are devices using QDIO, like zFCP adapters or OSA cards. In that case, the common I/O layer cannot squeeze in a msch; you might try disabling the device, but that's usually not something you want to do in a live system.

Extended channel measurements

What if you want to find out something not about an individual device, but for a channel path? There's a feature for that; you can issue
echo 1 > /sys/devices/css0/cm_enable
and will find new entries (measurement, measurement_chars) under the various chp0.xx objects.

Unfortunately, these attributes only provide some binary data, which does not seem to be publicly documented, and I'm not aware of any tool that can parse them.

Channel measurements in QEMU guests

So far, all measurements have been collected on the host; but what about measurements in the guest?

The good news: You can turn on channel measurements for ccw devices in the guest. The bad news: They are not very useful.

Consider, for example, this virtio-ccw device:
 head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
0
==> cmf/avg_device_disconnect_time <==
0
==> cmf/avg_function_pending_time <==
0
==> cmf/avg_initial_command_response_time <==
0
==> cmf/avg_sample_interval <==
-1
==> cmf/avg_utilization <==
00.0%
==> cmf/sample_count <==
0
==> cmf/ssch_rsch_count <==
134

No samples, just a ssch count. Why? QEMU does not fully emulate the sampling infrastructure; only counting of ssch is done (which is very easy to implement). Moreover, virtio-ccw devices use channel programs mainly to set up queues, negotiate features, etc., so measurements here do not reflect what is going on on the virtqueues, which would be the interesting part for performance issues.

But what about a dasd passed through via vfio-ccw? That one should have more statistics, right?
head cmf/*           
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
0
==> cmf/avg_device_disconnect_time <==
0
==> cmf/avg_function_pending_time <==
0
==> cmf/avg_initial_command_response_time <==
0
==> cmf/avg_sample_interval <==
-1
==> cmf/avg_utilization <==
00.0%
==> cmf/sample_count <==
0
==> cmf/ssch_rsch_count <==
144

No samples, just a ssch count, again. Why? Currently, vfio-ccw uses the same emulation infrastructure as the other emulated devices. In the future, we may implement some kind of passthrough for channel measurements, but that requires some work.

Friday, December 20, 2019

A 2019 recap (and a bit of an outlook)

The holiday season for 2019 will soon be upon us, so I decided to do a quick recap of what I consider the highlights for this year, from my perspective, regarding s390x virtualization, and the wider ecosystem.

Conferences

I attended the following conferences last year.

Linux Plumbers Conference

LPC 2019 was held in Lisbon, Portugal, on September 9-11. Of particular interest for me was the VFIO/IOMMU/PCI microconference. I talked a bit about cross-architecture considerations (and learned about some quirks on other architectures as well); the rest of the topics, while not currently concerning my work directly, were nevertheless helpful to move things forward. As usual on conferences, the hallway track is probably the most important one; met some new folks, saw others once again, and talked more about s390 I/O stuff than I anticipated. I can recommend this conference for meeting people to talk to about (not only) deeply technical things.

KVM Forum

KVM Forum 2019 was held in Lyon, France, on October 30 - November 1. As usual, a great place to meat people and have discussions in the hallway, e.g. about vfio migration. No talk from me this year, but an assortment of interesting topics presented by others; I contributed to an article on LWN.net (https://lwn.net/Articles/805097/). Of note from an s390x perspective were the talks about protected virtualization and nested testing mentioned in the article, and also the presentation on running kvm unit tests beyond KVM.

s390x changes in QEMU and elsewhere

There's a new machine (z15) on the horizon, but support for older things has been introduced or enhanced as well.

Emulation

Lots of work has gone into tcg to emulate the vector instructions introduced with z13. Distributions are slowly switching to compiling for z13, which means gcc is generating vector instructions. As of QEMU 4.2, you should be able to boot recent distributions under tcg once again.

vfio

vfio-ccw now has gained support for sending HALT SUBCHANNEL and CLEAR SUBCHANNEL to the real device; this is useful e.g. for error handling, when you want to make sure an operation is really terminated at the device. Also, it is now possible to boot from a DASD attached via vfio-ccw.

vfio-ap has seen some improvements as well, including support for hotplugging the matrix device and for interrupts.

Guest side

A big change on the guest side of things was support for protected virtualization (also see the talk given at KVM Forum). This is a bit like AMD's SEV, but (of course) different. Current Linux kernels should be ready to run as a protected guest; host side support is still in progress (see below).

Other developments of interest

mdev, mdev everywhere

There has been a lot of activity around mediated devices this year. They have been successfully used in various places in the past (GPUs, vfio-ccw, vfio-ap, ...). A new development is trying to push parts of it into userspace ('muser', see the talk at KVM Forum). An attempt was made to make use of the mediating part without the vfio part, but that was met with resistance. Ideas for aggregation are still being explored.

In order to manage and persist mdev devices, we introduced the mdevctl tool, which is currently included in at least Fedora and Debian.

vfio migration

Efforts to introduce generic migration support for vfio (or at least in the first shot, for pci) are still ongoing. Current concerns mostly cycle around dirty page tracking. It might make sense to take a stab at wiring up vfio-ccw once the interface is stable.

What's up next?

While there probably will be some not-yet-expected developments next year, some things are bound to come around in 2020.

Protected virtualization

Patch sets for KVM and QEMU to support protected virtualization on s390 have already been posted this year; expect new versions of the patch sets to show up in 2020 (and hopefully make their way into the kernel respectively QEMU).

vfio-ccw

Patches to support detecting path status changes and relaying them to the guest have already been posted; expect an updated version to make its way into the kernel and QEMU in 2020. Also likely: further cleanups and bugfixes, and probably some kind of testing support, e.g. via kvm unit tests. Migration support also might be on that list.

virtio-fs support on s390x

Instead of using virtio-9p, virtio-fs is a much better way to share files between host and guest; expect support on s390x to become available once sharing of fds in QEMU without numa becomes possible. Shared memory regions on s390x (for DAX support) still need discussion, however.

Wednesday, November 6, 2019

s390x changes in QEMU 4.2

You know the drill: QEMU is entering freeze (this time for 4.2), and there's a post on the s390x changes for the upcoming release.

TCG

  • Emulation for  IEP (Instruction Execution Protection), a z14 feature, has been added.
  • A bunch of fixes in the vector instruction emulation and in the fault-handling code.

KVM

  • For quite some time now, the code has been implicitly relying on the presence of the 'flic' (floating interrupt controller) KVM device (which had been added in Linux 3.15). Nobody really complained, so we won't try to fix this up and instead make the dependency explicit.
  • The KVM memslot handling was reworked to be actually sane. Unfortunately, this breaks migration of huge KVM guests with more than 8TB of memory from older QEMUs. Migration of guests with less than 8TB continues to work, and there's no planned breakage of migration of >8TB guests starting with 4.2.

CPU models

  • We now know that the gen15a is called 'z15', so reflect this in the cpu model description.
  • The 'qemu' and the 'max' models gained some more features.
  • Under KVM, 'query-machines' will now return the correct default cpu model ('host-s390x-cpu').

Misc

  • The usual array of bugfixes, including in SCLP handling and in the s390-ccw bios.

Wednesday, July 10, 2019

s390x changes in QEMU 4.1

QEMU has just entered hard freeze for 4.1, so the time is here again to summarize the s390x changes for that release.

TCG

  • All instructions that have been introduced with the "Vector Facility" in the z13 machines are now emulated by QEMU. In particular, this allows Linux distributions built for z13 or later to be run under TCG (vector instructions are generated when we compile for z13; other z13 facilities are optional.)

CPU Models

  • As the needed prerequisites in TCG now have been implemented, the "qemu" cpu model now includes the "Vector Facility" and has been bumped to a stripped-down z13.
  • Models for the upcoming gen15 machines (the official name is not yet known) and some new facilities have been added.
  • If the host kernel supports it, we now indicate the AP Queue Interruption facility. This is used by vfio-ap and allows to provide interrupts for AP to the guest.

I/O Devices

  • vfio-ccw has gained support for relaying HALT SUBCHANNEL and CLEAR SUBCHANNEL requests from the guest to the device, if the host kernel vfio-ccw driver supports it. Otherwise, these instructions continue to be emulated by QEMU, as before.
  • The bios now supports IPLing (booting) from DASD attached via vfio-ccw.

Booting

  • The bios tolerates signatures written by zipl, if present; but it does not actually handle them. See the 'secure' option for zipl introduced in s390-tools 2.9.0.
And the usual fixes and cleanups.

Tuesday, March 12, 2019

s390x changes in QEMU 4.0

QEMU is now entering softfreeze for the 4.0 release (expected in April), so here is the usual summary of s390x changes in that release.

CPU Models

  • A cpu model for the z14 GA 2 has been added. Currently, no new features have been added.
  • The cpu model for z14 now does, however, include the multiple epoch and PTFF enhancement features per default.
  • The 'qemu' cpu model now includes the zPCI feature per default. No more prerequisites are needed for pci support (see below).

Devices


  • QEMU for s390x is now always built with pci support. If we want to provide backwards compatibility,  we cannot simply disable pci (we need the s390 pci host bus); it is easier to simply make pci mandatory. Note that disabling pci was never supported by the normal build system anyway.
  • zPCI devices have gained support for instruction counters (on a Linux guest, these are exposed through /sys/kernel/debug/pci/<function>/statistics).
  • zPCI devices always lacked support for migrating their s390-specific state (not implemented...); if you tried to migrate a guest with a virtio-pci device on s390x, odd things might happen. To avoid surprises, the 'zpci' devices are now explicitly marked as unmigratable. (Support for migration will likely be added in the future.)
  • Hot(un)plug of the vfio-ap matrix device is now supported.
  • Adding a vfio-ap matrix device no longer inhibits usage of a memory ballooner: Memory usage by vfio-ap does not clash with the concept of a memory balloon.

TCG

  • Support for the floating-point extension facility has been added.
  • The first part of support for z13 vector instructions has been added (vector support instructions). Expect support for the remaining vector instructions in the next release; it should support enough of the instructions introduced with z13 to be able to run a distribution built for that cpu.