Friday, July 10, 2020

s390x changes in QEMU 5.1

QEMU has entered softfreeze for 5.1, so it is time to summarize the s390x changes in that version.

Protected virtualization

One of the biggest features on the s390/KVM side in Linux 5.7 had been protected virtualization aka secure execution, which basically restricts the (untrusted) hypervisor from accessing all of the guest's memory and delegates many tasks to the (trusted) ultravisor. QEMU 5.1 introduces the QEMU part of the feature.
In order to be able to run protected guests, you need to run on a z15 or a Linux One III, with at least a 5.7 kernel. You also need an up-to-date s390-tools installation. Some details are available in the QEMU documentation. For more information about what protected virtualization is, watch this talk from KVM Forum 2019 and this talk from 36C3.

vfio-ccw

vfio-ccw has also seen some improvements over the last release cycle.
  • Requests that do not explicitly allow prefetching in the ORB are no longer rejected out of hand (although the kernel may still do so, if you run a pre-5.7 version.) The rationale behind this is that most device drivers never modify their channel programs dynamically, and the one common code path that does (IPL from DASD) is already accommodated by the s390-ccw bios. While you can instruct QEMU to ignore the prefetch requirement for selected devices, this is an additional administrative complication for little benefit; it is therefore no longer required.
  • In order to be able to relay changes in channel path status to the guest, two new regions have been added: a schib region to relay real data to stsch, and a crw region to relay channel reports. If, for example, a channel path is varied off on the host, all guests using a vfio-ccw device that uses this channel path now get a proper channel report for it.

Other changes

Other than the bigger features mentioned above, there have been the usual fixes, improvements, and cleanups, both in the main s390x QEMU code and in the s390-ccw bios.

Wednesday, April 8, 2020

s390x changes in QEMU 5.0

QEMU is currently in hardfreeze, with the 5.0 release expected at the end of the month. Here's a quick list of some notable s390x changes.

  • You can finally enable Adapter Interrupt Suppression in the cpu model (ais=on) when running under KVM. This had been working under TCG for some time now, but KVM was missing an interface that was provided later -- and we finally actually check for that interface in QEMU. This is mostly interesting for PCI.
  • QEMU had been silently fixing odd memory sizes to something that can be reported via SCLP for some time. Silently changing user input is probably not such a good idea; compat machines will continue to do so to enable migration from old QEMUs for machines with odd sizes, but will print a warning now. If you have such an old machine (and you can modify it), it might be a good idea to either specify the memory size it gets rounded to or to switch to the 5.0 machine type, where memory sizes can be more finegrained due to the removal of support for memory hotplug. We may want to get rid of the code doing the fixup at some time in the future.
  • QEMU now properly performs the whole set of initial, clear, and normal cpu reset.
  • And the usual fixes, cleanups, and improvements.
For 5.1, expect more changes; support for protected virtualization will be a big item.

Wednesday, January 22, 2020

Channel Measurements: A Quick Overview

The s390 channel subsystem can gather some statistics on I/O performance for you, which might be useful if you try to figure out why something is not performing as well as you'd expect it to be. From a QEMU/KVM perspective, this is currently mainly useful on the host.

Channel monitoring for ccw devices

The first kind of channel measurements is those collected per subchannel. For a detailed overview of what actually happens there, turn to the Principles of Operation, Chapter 17 ("I/O Support Functions"), "Channel Monitoring". I'll cover here what will most likely be of interest to people running a Linux (host) system.

Enabling channel measurements

If you a running a non-vintage machine (i.e. a z990 or later), you will not need a system-wide setup. Older machines should be fine as well, if you do not want to measure more than 1024 devices.

To enable measurements for a specific ccw device (say, 0.0.1234), simply issue:

chccwdev -a cmb_enable=1 0.0.1234

Measurements collected

Under /sys/bus/ccw/device/0.0.1234/, you should now have a new subdirectory called cmf, which contains some files. For a system that has been running for some time, the contents may look something like the following:

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
829031
==> cmf/avg_device_disconnect_time <==
398526
==> cmf/avg_function_pending_time <==
142810
==> cmf/avg_initial_command_response_time <==
19170
==> cmf/avg_sample_interval <==
8401681344
==> cmf/avg_utilization <==
00.0%
==> cmf/sample_count <==
10803
==> cmf/ssch_rsch_count <==
10803

Note that all values but sample_count and ssch_rsch_count are averaged over time. We also see that samples seem to have been taken whenever the driver issued a ssch.

The device in our example shows an avg_utilization of 0%, which is consistent with a device that mostly sits idle. But what about a device where something is actually happening?

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
58454
==> cmf/avg_device_disconnect_time <==
16743818
==> cmf/avg_function_pending_time <==
99322
==> cmf/avg_initial_command_response_time <==
20284
==> cmf/avg_sample_interval <==
153014636
==> cmf/avg_utilization <==
11.0%
==> cmf/sample_count <==
1281
==> cmf/ssch_rsch_count <==
1281

Here, we see a higher avg_utilization, but actually not that many ssch invocations. Interesting is the relatively high value of avg_device_disconnect_time: It indicates that there are quite long intervals where the device and the channel subsystem do not talk to each other. That might, for example, happen if other LPARs on the same system drive a lot of I/O via the same channel paths as the device.

Help, I cannot enable channel measurements on my device!

There's one drawback when trying to enable channel measurements on a live device: It needs to execute a msch, which only can be done on an idle subchannel. For devices that execute separate ssch invocations to go about their business (e.g. dasd), the common I/O layer can squeeze in the msch between ssch invocations and all is well. However, some devices use a long-running channel program, which will not conclude during the time the device is enabled; the most prominent example are devices using QDIO, like zFCP adapters or OSA cards. In that case, the common I/O layer cannot squeeze in a msch; you might try disabling the device, but that's usually not something you want to do in a live system.

Extended channel measurements

What if you want to find out something not about an individual device, but for a channel path? There's a feature for that; you can issue
echo 1 > /sys/devices/css0/cm_enable
and will find new entries (measurement, measurement_chars) under the various chp0.xx objects.

Unfortunately, these attributes only provide some binary data, which does not seem to be publicly documented, and I'm not aware of any tool that can parse them.

Channel measurements in QEMU guests

So far, all measurements have been collected on the host; but what about measurements in the guest?

The good news: You can turn on channel measurements for ccw devices in the guest. The bad news: They are not very useful.

Consider, for example, this virtio-ccw device:
 head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
0
==> cmf/avg_device_disconnect_time <==
0
==> cmf/avg_function_pending_time <==
0
==> cmf/avg_initial_command_response_time <==
0
==> cmf/avg_sample_interval <==
-1
==> cmf/avg_utilization <==
00.0%
==> cmf/sample_count <==
0
==> cmf/ssch_rsch_count <==
134

No samples, just a ssch count. Why? QEMU does not fully emulate the sampling infrastructure; only counting of ssch is done (which is very easy to implement). Moreover, virtio-ccw devices use channel programs mainly to set up queues, negotiate features, etc., so measurements here do not reflect what is going on on the virtqueues, which would be the interesting part for performance issues.

But what about a dasd passed through via vfio-ccw? That one should have more statistics, right?
head cmf/*           
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
0
==> cmf/avg_device_disconnect_time <==
0
==> cmf/avg_function_pending_time <==
0
==> cmf/avg_initial_command_response_time <==
0
==> cmf/avg_sample_interval <==
-1
==> cmf/avg_utilization <==
00.0%
==> cmf/sample_count <==
0
==> cmf/ssch_rsch_count <==
144

No samples, just a ssch count, again. Why? Currently, vfio-ccw uses the same emulation infrastructure as the other emulated devices. In the future, we may implement some kind of passthrough for channel measurements, but that requires some work.