KVM, QEMU and Big Iron

Tuesday, November 2, 2021

This blog has moved

For various reasons (ease of writing, avoiding unwanted scripts/tracking, ...), I have migrated this blog (including all posts so far) to https://people.redhat.com/~cohuck/ and will (hopefully :) be publishing new posts there. I will keep this one up for the time being.

See you there!

Tuesday, November 24, 2020

s390x changes in QEMU 5.2

As, once again, a new QEMU release is around the corner, the time has come to list some s390x changes in there.

TCG has gained emulation support for some additional instructions that had been introduced with the z14. More enhancements needed to be able to run distributions built for the z14 will likely come in the future.
When running under KVM, QEMU now supports the diagnose 0x318 instruction. This can be used to set some diagnostic information (such as the operating system), which may be helpful when servicing the hardware. With this comes support for extended SCCBs; this is needed as the facility indication for diag318 encroaches into the control block used for reporting CPU information. A guest needs support for extended SCCBs to be able to see information for all CPUs if diag318 support is provided.
You can now use virtiofs on s390x, thanks to some endianness fixes, and a vhost-user-fs-ccw device has been added.
Up to now, both fully emulated PCI functions and PCI functions passed via vfio-pci reported the same values when the guest issued CLP instructions. However, the passed through functions may use different values for things such as the supported DMA range. If the host kernel supplies the respective capabilities for the vfio-pci device, QEMU can now provide the real values in the CLP queries.
zPCI is now also able to honour vfio DMA limits, if passed via the vfio-pci device, and can trigger the guest to flush its DMA mappings when needed.
The s390-ccw bios now tries harder to find a bootable device, if the first device is not suitable. This brings s390x booting a bit closer to what other architectures do.
And the usual fixes and cleanups.

Wednesday, July 29, 2020

Configuring mediated devices (Part 2)

In the last part of this article, I talked about configuring a mediated device directly via sysfs. This is a bit cumbersome, and you may want to make your configuration more permanent. Fortunately, there is tooling available for this.

driverctl: bind to the correct driver

driverctl is a tool to manage the driver that a device may bind to. As a device that is supposed to be used via vfio will need to be bound to a vfio driver instead of its 'normal' driver, it makes sense to add some configuration that makes sure that this binding is actually done automatically. While driverctl had originally been implemented to work with PCI devices, the css bus (for subchannel devices) supports management with driverctl as of Linux 5.3 as well. (The ap bus for crypto devices does not support setting driver overrides, as it implements a different mechanism.)

Example (vfio-ccw)

Let's reuse the example from the last post, where we wanted to assign the device behind subchannel 0.0.0313 to the guest. In order to set a driver override, use

[root@host ~]# driverctl -b css set-override 0.0.0313 vfio_ccw

If the subchannel is not currently bound to the vfio-ccw driver already, it will be unbound from its driver and bound to vfio_ccw. Moreover, a udev rule to bind the subchannel to vfio_ccw automatically in the future will be added.

Unfortunately, a word of caution regarding the udev rule is in order: As uevents on the css bus for I/O subchannels are delayed until after device recognition has been performed, automatic binding may not work out as desired. We plan to address that in the future by reworking the way the css bus handles uevents; until then, you may have to trigger a rebind manually. Also, keep in mind that the subchannel id for a device may not be stable (as mentioned previously); automation should be used cautiously in that case.

mdevctl: manage mediated devices

The more tedious part of configuring a passthrough setup is configuring and managing mediated devices. To help with that, mdevctl has been written. It can create, modify, and remove mediated devices (and optionally make those changes persistent), work with configurations and devices created via other means, and list mediated devices and the different types that are supported.

Creating a mediated device

In order to create a mediated device, you need a uuid. You can either provide your own (as in the manual case), or let mdevctl pick one for you. In order to get the same configuration as in the manual configuration examples, let's create a vfio-ccw device with the same uuid as before.

The following command defines the same mediated device as in the manual example:

[root@host ~]# mdevctl define -u 7e270a25-e163-4922-af60-757fc8ed48c6 -p 0.0.0313 -t vfio_ccw-io -a

Note the '-a', which instructs mdevctl to start the device automatically from now on.

After you've created the device, you can check which devices mdevctl is now aware of:

[root@host ~] # mdevctl list -d
7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io

Note that the '-d' instructs mdevctl to show defined, but not started devices.

Let's start the device:

[root@host ~] # mdevctl start -u 7e270a25-e163-4922-af60-757fc8ed48c6
[root@host ~] # mdevctl list -d
7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io auto (active)

The mediated device is now ready to be used and can be passed to a guest.

Making your configuration persistent

If you already created a mediated device manually, you may want to reuse the existing configuration and make it persistent, instead of starting from scratch.

So, let's create another vfio-ccw the manual way:

[root@host ~] # uuidgen
b29e4ca9-5cdb-4ee1-a01b-79085b9ab237
[root@host ~] # echo "b29e4ca9-5cdb-4ee1-a01b-79085b9ab237" > /sys/bus/css/drivers/vfio_ccw/0.0.0314/mdev_supported_types/vfio_ccw-io/create

mdevctl now actually knows about the active device (in addition to the device we configured before):

[root@host ~] # mdevctl list
b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 0.0.0314 vfio_ccw-io
7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io (defined)

But it obviously does not have a definition for the manually created device:

[root@host ~] # mdevctl list -d
7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io auto (active)

On a restart, the new device would be gone again; but we can make it persistent:

[root@host ~] # mdevctl define -u b29e4ca9-5cdb-4ee1-a01b-79085b9ab237
[root@host ~ ] mdevctl list
b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 0.0.0314 vfio_ccw-io (defined)
7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io (defined)

If you check under /etc/mdevctl.d/, you will find that an appropriate JSON file has been created:

[root@host ~] # cat /etc/mdevctl.d/0.0.0314/b29e4ca9-5cdb-4ee1-a01b-79085b9ab237
{
"mdev_type": "vfio_ccw-io",
"start": "manual",
"attrs": []
}

(Note that this device is not automatically started by default.)

Modifying an existing device

There are good reasons to modify an existing device: you may want to modify your setup, or, in the case of vfio-ap, you need to modify some attributes before being able to use the device in the first place.

Let's first create the device. This command creates the same device as created manually in the last post:

[root@host ~] # mdevctl define -u "669d9b23-fe1b-4ecb-be08-a2fabca99b71" --parent matrix --type vfio_ap-passthrough
[root@host ~] # mdevctl list -d
669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual

This device is not yet very useful, as you still need to assign some queues to it. It now looks like this:

[root@host ~] # mdevctl list -d -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --dumpjson
{
"mdev_type": "vfio_ap-passthrough",
"start": "manual"
}

Let's modify the device and add some queues:

[root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_adapter --value=5

[root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=4

[root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=0xab

The device's JSON now looks like this:

[root@host ~] # mdevctl list -d -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --dumpjson

{

"mdev_type": "vfio_ap-passthrough",

"start": "manual",

"attrs": [

{

"assign_adapter": "5"

{

"assign_domain": "4"

{

"assign_domain": "0xab"

}

]

}

This is now exactly what we had defined manually in the last post.

But what if you notice that you want domain 0x42 instead of domain 4? Just modify the definition. To make it easier to figure out how to specify the attribute to manipulate, use this output:

[root@host ~] # devctl list -dv -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71

669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual

Attrs:

@{0}: {"assign_adapter":"5"}

@{1}: {"assign_domain":"4"}

@{2}: {"assign_domain":"0xab"}

You want to remove attribute 1, and add a new value:

[root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --delattr --index=1

[root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=0x42

Let's check that it now looks as desired:

[root@host ~] # mdevctl list -dv -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71

669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual

Attrs:

@{0}: {"assign_adapter":"5"}

@{1}: {"assign_domain":"0xab"}

@{2}: {"assign_domain":"0x42"}

Future development

While mdevctl works perfectly fine for managing individual mediated devices, it does not maintain a view of the complete system. This means you notice conflicts between two devices only when you try to activate the second one. In the case of vfio-ap, the rules to be considered are complex, and there is quite some potential for conflict. In order to be able to catch that kind of problem early, we plan to add callouts to mdevctl, which would e.g. allow to invoke a tool for validation when a new device is added, but before it is activated. This is potentially useful for other device types as well.

Monday, July 27, 2020

Configuring mediated devices (Part 1)

vfio-mdev has become popular over the last few years for assigning certain classes of devices to guests. On the s390x side, vfio-ccw and vfio-ap are using the vfio-mdev framework for making channel devices and crypto adapters accessible to guests.
This and a follow-up article aim to give an overview of the infrastructure, how to set up and manage devices, and how to use tooling for this.

What is a mediated device?

A general overview

Mediated devices grew out of the need to build upon the existing vfio infrastructure in order to support more fine grained management of resources. Some of the initial use cases included GPUs and (maybe somewhat surprisingly) s390 channel devices.

When using the mediated device (mdev) API, common tasks are performed in the mdev core driver (like device management), while device-specific tasks are done in a vendor driver. Current in-kernel examples of vendor drivers are the Intel vGPU driver, vfio-ccw, and vfio-ap.

Examples on s390

vfio-ccw

vfio-ccw can be used to assign channel devices. It is pretty straightforward: vfio-ccw is an alternative driver for I/O subchannels, and a single mediated device per subchannel is supported.

vfio-ap

vfio-ap can be used to assign crypto cards/queues (APQNs). It is a bit more involved, requiring prior setup on the ap bus level and configuration of a 'matrix' device. Complex relationships between the resources that can be assigned to different guests exist. Configuration-wise, this is probably the most complex mediated device available today.

Configuring a mediated device: the manual way

Mediated devices can be configured manually via sysfs operations. This is a good way to see what actually happens, but probably not what you want to do as a general administration task. Tools to help here will be introduced in part 2 of this article.

I will show the steps for both vfio-ccw and vfio-ap, just to show two different approaches. (Both examples are also used in the QEMU documentation, in case this looks familiar.)

Binding to the correct driver

vfio-ccw

Assume you want to use a DASD with the device bus ID 0.0.2b09. As vfio-ccw operates on the subchannel level, you first need to locate the subchannel for this device:

[root@host ~]# lscss | grep 0.0.2b09 | awk '{print $2}'

0.0.0313

(A word of caution: a device is not guaranteed to use the same subchannel at all times; on LPARs, the subchannel number will usually be stable, but z/VM -- and QEMU -- assign subchannel numbers in a consecutive order. If you don't get any hotplug events for a device, the subchannel number will stay stable for at least as long as the guest is running, though.)

Now you need to unbind the subchannel device from the default I/O subchannel driver and bind it to the vfio-ccw driver (make sure the device is not in use!):

[root@host ~]# echo 0.0.0313 > /sys/bus/css/devices/0.0.0313/driver/unbind

[root@host ~]# echo 0.0.0313 > /sys/bus/css/drivers/vfio_ccw/bind

vfio-ap

You need to perform some preliminary configuration of your crypto adapters before you can use any of them with vfio-ap. If nothing different has been set up, a crypto adapter will only bind to the default device drivers, and you cannot use it via vfio-ap. In order to be able to bind an adapter to vfio-ap, you first need to modify the /sys/bus/ap/apmask and /sys/bus/ap/aqmask entries. Both are basically bitmasks that indicate that the matching adapter IDs respectively queue indices can only be bound to the default drivers. If you want to use a certain APQN via vfio-ap, you need to unset the respective bits.

Let's assume you want to assign the APQNs (5, 4) and (5, ab). First, you need to make the adapter and the domains available to non-default drivers:

[root@host ~]# echo -5 > /sys/bus/ap/apmask
[root@host ~]# echo -4, -0xab > /sys/bus/ap/aqmask

This should result in the devices being bound to the vfio_ap driver (you can verify this by looking for them under /sys/bus/ap/drivers/vfio_ap/).

Create a mediated device

The basic workflow is "pick a uuid, create a mediated device identified by it".

vfio-ccw

For vfio-ccw, the two steps of the basic workflow are enough:

[root@host ~]# uuidgen

7e270a25-e163-4922-af60-757fc8ed48c6

[root@host ~]# echo "7e270a25-e163-4922-af60-757fc8ed48c6" > \

/sys/bus/css/devices/0.0.0313/mdev_supported_types/vfio_ccw-io/create

vfio-ap

For vfio-ap, you need a more involved approach. The uuid is used to create a mediated device under the 'matrix' device:

[root@host ~] # uuidgen
669d9b23-fe1b-4ecb-be08-a2fabca99b71
[root@host ~]# echo "669d9b23-fe1b-4ecb-be08-a2fabca99b71" > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/create

This mediated device will need to collect all APQNs that you want to pass to a specific guest. For that, you need to use the assign_adapter, assign_domain, and possibly assign_control_domain attributes (we'll ignore control domains for simplicity's sake.) All attributes have a companion unassign_ attribute to remove adapters/domains from the mediated device again. You can only assign adapters/domains that you removed from apmask/aqmask in the previous step. To follow up on our example again:

[root@host ~]# echo 5 > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_adapter
[root@host ~]# echo 4 > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_domain
[root@host ~]# echo 0xab > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_domain

If you want to make sure that the mediated device is set up correctly, check via

[root@host ~]# cat /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/matrix
05.0004
05.00ab

Configuring QEMU/libvirt

Your mediated device is now ready to be passed to a guest.

vfio-ccw

Let's assume you want the device to show up as device 0.0.1234 in the guest.

For the QEMU command line, use

-device vfio-ccw,devno=fe.0.1234,sysfsdev=\

/sys/bus/mdev/devices/7e270a25-e163-4922-af60-757fc8ed48c6

For libvirt, use the following XML snippet in the <devices> section:

</source>

</hostdev>

vfio-ap

Any APQNs will show up in the guest exactly as they show up in the host (i.e., no remapping is possible.)

For the QEMU command line, use

-device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/669d9b23-fe1b-4ecb-be08-a2fabca99b71

For libvirt, use the following XML snippet in the <devices> section:

Tooling

All this manual setup is a bit tedious; the next article in this series will look at some of the tooling that is available for mediated devices.

Friday, July 10, 2020

s390x changes in QEMU 5.1

QEMU has entered softfreeze for 5.1, so it is time to summarize the s390x changes in that version.

Protected virtualization

One of the biggest features on the s390/KVM side in Linux 5.7 had been protected virtualization aka secure execution, which basically restricts the (untrusted) hypervisor from accessing all of the guest's memory and delegates many tasks to the (trusted) ultravisor. QEMU 5.1 introduces the QEMU part of the feature.

In order to be able to run protected guests, you need to run on a z15 or a Linux One III, with at least a 5.7 kernel. You also need an up-to-date s390-tools installation. Some details are available in the QEMU documentation. For more information about what protected virtualization is, watch this talk from KVM Forum 2019 and this talk from 36C3.

vfio-ccw

vfio-ccw has also seen some improvements over the last release cycle.

Requests that do not explicitly allow prefetching in the ORB are no longer rejected out of hand (although the kernel may still do so, if you run a pre-5.7 version.) The rationale behind this is that most device drivers never modify their channel programs dynamically, and the one common code path that does (IPL from DASD) is already accommodated by the s390-ccw bios. While you can instruct QEMU to ignore the prefetch requirement for selected devices, this is an additional administrative complication for little benefit; it is therefore no longer required.
In order to be able to relay changes in channel path status to the guest, two new regions have been added: a schib region to relay real data to stsch, and a crw region to relay channel reports. If, for example, a channel path is varied off on the host, all guests using a vfio-ccw device that uses this channel path now get a proper channel report for it.

Other changes

Other than the bigger features mentioned above, there have been the usual fixes, improvements, and cleanups, both in the main s390x QEMU code and in the s390-ccw bios.

Wednesday, April 8, 2020

s390x changes in QEMU 5.0

QEMU is currently in hardfreeze, with the 5.0 release expected at the end of the month. Here's a quick list of some notable s390x changes.

You can finally enable Adapter Interrupt Suppression in the cpu model (ais=on) when running under KVM. This had been working under TCG for some time now, but KVM was missing an interface that was provided later -- and we finally actually check for that interface in QEMU. This is mostly interesting for PCI.
QEMU had been silently fixing odd memory sizes to something that can be reported via SCLP for some time. Silently changing user input is probably not such a good idea; compat machines will continue to do so to enable migration from old QEMUs for machines with odd sizes, but will print a warning now. If you have such an old machine (and you can modify it), it might be a good idea to either specify the memory size it gets rounded to or to switch to the 5.0 machine type, where memory sizes can be more finegrained due to the removal of support for memory hotplug. We may want to get rid of the code doing the fixup at some time in the future.
QEMU now properly performs the whole set of initial, clear, and normal cpu reset.
And the usual fixes, cleanups, and improvements.

For 5.1, expect more changes; support for protected virtualization will be a big item.

Wednesday, January 22, 2020

Channel Measurements: A Quick Overview

The s390 channel subsystem can gather some statistics on I/O performance for you, which might be useful if you try to figure out why something is not performing as well as you'd expect it to be. From a QEMU/KVM perspective, this is currently mainly useful on the host.

Channel monitoring for ccw devices

The first kind of channel measurements is those collected per subchannel. For a detailed overview of what actually happens there, turn to the Principles of Operation, Chapter 17 ("I/O Support Functions"), "Channel Monitoring". I'll cover here what will most likely be of interest to people running a Linux (host) system.

Enabling channel measurements

If you a running a non-vintage machine (i.e. a z990 or later), you will not need a system-wide setup. Older machines should be fine as well, if you do not want to measure more than 1024 devices.

To enable measurements for a specific ccw device (say, 0.0.1234), simply issue:

chccwdev -a cmb_enable=1 0.0.1234

Measurements collected

Under /sys/bus/ccw/device/0.0.1234/, you should now have a new subdirectory called cmf, which contains some files. For a system that has been running for some time, the contents may look something like the following:

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
829031
==> cmf/avg_device_disconnect_time <==
398526
==> cmf/avg_function_pending_time <==
142810
==> cmf/avg_initial_command_response_time <==
19170
==> cmf/avg_sample_interval <==
8401681344
==> cmf/avg_utilization <==
00.0%
==> cmf/sample_count <==
10803
==> cmf/ssch_rsch_count <==
10803

Note that all values but sample_count and ssch_rsch_count are averaged over time. We also see that samples seem to have been taken whenever the driver issued a ssch.

The device in our example shows an avg_utilization of 0%, which is consistent with a device that mostly sits idle. But what about a device where something is actually happening?

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
58454
==> cmf/avg_device_disconnect_time <==
16743818
==> cmf/avg_function_pending_time <==
99322
==> cmf/avg_initial_command_response_time <==
20284
==> cmf/avg_sample_interval <==
153014636
==> cmf/avg_utilization <==
11.0%
==> cmf/sample_count <==
1281
==> cmf/ssch_rsch_count <==
1281

Here, we see a higher avg_utilization, but actually not that many ssch invocations. Interesting is the relatively high value of avg_device_disconnect_time: It indicates that there are quite long intervals where the device and the channel subsystem do not talk to each other. That might, for example, happen if other LPARs on the same system drive a lot of I/O via the same channel paths as the device.

Help, I cannot enable channel measurements on my device!

There's one drawback when trying to enable channel measurements on a live device: It needs to execute a msch, which only can be done on an idle subchannel. For devices that execute separate ssch invocations to go about their business (e.g. dasd), the common I/O layer can squeeze in the msch between ssch invocations and all is well. However, some devices use a long-running channel program, which will not conclude during the time the device is enabled; the most prominent example are devices using QDIO, like zFCP adapters or OSA cards. In that case, the common I/O layer cannot squeeze in a msch; you might try disabling the device, but that's usually not something you want to do in a live system.

Extended channel measurements

What if you want to find out something not about an individual device, but for a channel path? There's a feature for that; you can issue

echo 1 > /sys/devices/css0/cm_enable

and will find new entries (measurement, measurement_chars) under the various chp0.xx objects.

Unfortunately, these attributes only provide some binary data, which does not seem to be publicly documented, and I'm not aware of any tool that can parse them.

Channel measurements in QEMU guests

So far, all measurements have been collected on the host; but what about measurements in the guest?

The good news: You can turn on channel measurements for ccw devices in the guest. The bad news: They are not very useful.

Consider, for example, this virtio-ccw device:

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
0
==> cmf/avg_device_disconnect_time <==
0
==> cmf/avg_function_pending_time <==
0
==> cmf/avg_initial_command_response_time <==
0
==> cmf/avg_sample_interval <==
-1
==> cmf/avg_utilization <==
00.0%
==> cmf/sample_count <==
0
==> cmf/ssch_rsch_count <==
134

No samples, just a ssch count. Why? QEMU does not fully emulate the sampling infrastructure; only counting of ssch is done (which is very easy to implement). Moreover, virtio-ccw devices use channel programs mainly to set up queues, negotiate features, etc., so measurements here do not reflect what is going on on the virtqueues, which would be the interesting part for performance issues.

But what about a dasd passed through via vfio-ccw? That one should have more statistics, right?

head cmf/*
==> cmf/avg_control_unit_queuing_time <==
0
==> cmf/avg_device_active_only_time <==
0
==> cmf/avg_device_busy_time <==
0
==> cmf/avg_device_connect_time <==
0
==> cmf/avg_device_disconnect_time <==
0
==> cmf/avg_function_pending_time <==
0
==> cmf/avg_initial_command_response_time <==
0
==> cmf/avg_sample_interval <==
-1
==> cmf/avg_utilization <==
00.0%
==> cmf/sample_count <==
0
==> cmf/ssch_rsch_count <==
144

No samples, just a ssch count, again. Why? Currently, vfio-ccw uses the same emulation infrastructure as the other emulated devices. In the future, we may implement some kind of passthrough for channel measurements, but that requires some work.