Wednesday, July 29, 2020

Configuring mediated devices (Part 2)

In the last part of this article, I talked about configuring a mediated device directly via sysfs. This is a bit cumbersome, and you may want to make your configuration more permanent. Fortunately, there is tooling available for this.

driverctl: bind to the correct driver

driverctl is a tool to manage the driver that a device may bind to. As a device that is supposed to be used via vfio will need to be bound to a vfio driver instead of its 'normal' driver, it makes sense to add some configuration that makes sure that this binding is actually done automatically. While driverctl had originally been implemented to work with PCI devices, the css bus (for subchannel devices) supports management with driverctl as of Linux 5.3 as well. (The ap bus for crypto devices does not support setting driver overrides, as it implements a different mechanism.)

Example (vfio-ccw)

Let's reuse the example from the last post, where we wanted to assign the device behind subchannel 0.0.0313 to the guest. In order to set a driver override, use

[root@host ~]# driverctl -b css set-override 0.0.0313 vfio_ccw

If the subchannel is not currently bound to the vfio-ccw driver already, it will be unbound from its driver and bound to vfio_ccw. Moreover, a udev rule to bind the subchannel to vfio_ccw automatically in the future will be added.

Unfortunately, a word of caution regarding the udev rule is in order: As uevents on the css bus for I/O subchannels are delayed until after device recognition has been performed, automatic binding may not work out as desired. We plan to address that in the future by reworking the way the css bus handles uevents; until then, you may have to trigger a rebind manually. Also, keep in mind that the subchannel id for a device may not be stable (as mentioned previously); automation should be used cautiously in that case.

mdevctl: manage mediated devices

The more tedious part of configuring a passthrough setup is configuring and managing mediated devices. To help with that, mdevctl has been written. It can create, modify, and remove mediated devices (and optionally make those changes persistent), work with configurations and devices created via other means, and list mediated devices and the different types that are supported.

Creating a mediated device

In order to create a mediated device, you need a uuid. You can either provide your own (as in the manual case), or let mdevctl pick one for you. In order to get the same configuration as in the manual configuration examples, let's create a vfio-ccw device with the same uuid as before.

The following command defines the same mediated device as in the manual example:
  
 [root@host ~]# mdevctl define -u 7e270a25-e163-4922-af60-757fc8ed48c6 -p 0.0.0313 -t vfio_ccw-io -a

Note the '-a', which instructs mdevctl to start the device automatically from now on.

After you've created the device, you can check which devices mdevctl is now aware of:

  [root@host ~] # mdevctl list -d
 7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io

Note that the '-d' instructs mdevctl to show defined, but not started devices.

Let's start the device:

  [root@host ~] # mdevctl start -u 7e270a25-e163-4922-af60-757fc8ed48c6
 [root@host ~] # mdevctl list -d
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io auto (active)

The mediated device is now ready to be used and can be passed to a guest.

Making your configuration persistent

If you already created a mediated device manually, you may want to reuse the existing configuration and make it persistent, instead of starting from scratch.

So, let's create another vfio-ccw the manual way:

 [root@host ~] # uuidgen
  b29e4ca9-5cdb-4ee1-a01b-79085b9ab237
 [root@host ~] # echo "b29e4ca9-5cdb-4ee1-a01b-79085b9ab237" > /sys/bus/css/drivers/vfio_ccw/0.0.0314/mdev_supported_types/vfio_ccw-io/create

mdevctl now actually knows about the active device (in addition to the device we configured before):

  [root@host ~] # mdevctl list
  b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 0.0.0314 vfio_ccw-io
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io (defined)

But it obviously does not have a definition for the manually created device:

  [root@host ~] # mdevctl list -d
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io auto (active)

On a restart, the new device would be gone again; but we can make it persistent:

  [root@host ~] # mdevctl define -u b29e4ca9-5cdb-4ee1-a01b-79085b9ab237
  [root@host ~ ] mdevctl list
  b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 0.0.0314 vfio_ccw-io (defined)
  7e270a25-e163-4922-af60-757fc8ed48c6 0.0.0313 vfio_ccw-io (defined)

If you check under /etc/mdevctl.d/, you will find that an appropriate JSON file has been created:

  [root@host ~] # cat /etc/mdevctl.d/0.0.0314/b29e4ca9-5cdb-4ee1-a01b-79085b9ab237 
  {
    "mdev_type": "vfio_ccw-io",
    "start": "manual",
    "attrs": []
  }

(Note that this device is not automatically started by default.)

Modifying an existing device

There are good reasons to modify an existing device: you may want to modify your setup, or, in the case of vfio-ap, you need to modify some attributes before being able to use the device in the first place.

Let's first create the device. This command creates the same device as created manually in the last post:

  [root@host ~] # mdevctl define -u "669d9b23-fe1b-4ecb-be08-a2fabca99b71" --parent matrix --type vfio_ap-passthrough
 [root@host ~] # mdevctl list -d
  669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual

This device is not yet very useful, as you still need to assign some queues to it. It now looks like this:

  [root@host ~]  # mdevctl list -d -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --dumpjson
  {
    "mdev_type": "vfio_ap-passthrough",
    "start": "manual"
  }

Let's modify the device and add some queues:

  [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_adapter --value=5
 [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=4
 [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=0xab

The device's JSON now looks like this:

  [root@host ~] # mdevctl list -d -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --dumpjson
{
  "mdev_type": "vfio_ap-passthrough",
  "start": "manual",
  "attrs": [
    {
      "assign_adapter": "5"
    },
    {
      "assign_domain": "4"
    },
    {
      "assign_domain": "0xab"
    }
  ]
}

This is now exactly what we had defined manually in the last post.

But what if you notice that you want domain 0x42 instead of domain 4? Just modify the definition. To make it easier to figure out how to specify the attribute to manipulate, use this output:

  [root@host ~] # devctl list -dv -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71
669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual
  Attrs:
    @{0}: {"assign_adapter":"5"}
    @{1}: {"assign_domain":"4"}
    @{2}: {"assign_domain":"0xab"}

You want to remove attribute 1, and add a new value:

  [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --delattr --index=1
  [root@host ~] # mdevctl modify -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71 --addattr=assign_domain --value=0x42

Let's check that it now looks as desired:

  [root@host ~] # mdevctl list -dv -u 669d9b23-fe1b-4ecb-be08-a2fabca99b71
669d9b23-fe1b-4ecb-be08-a2fabca99b71 matrix vfio_ap-passthrough manual
  Attrs:
    @{0}: {"assign_adapter":"5"}
    @{1}: {"assign_domain":"0xab"}
    @{2}: {"assign_domain":"0x42"}

Future development

While mdevctl works perfectly fine for managing individual mediated devices, it does not maintain a view of the complete system. This means you notice conflicts between two devices only when you try to activate the second one. In the case of vfio-ap, the rules to be considered are complex, and there is quite some potential for conflict. In order to be able to catch that kind of problem early, we plan to add callouts to mdevctl, which would e.g. allow to invoke a tool for validation when a new device is added, but before it is activated. This is potentially useful for other device types as well.

Monday, July 27, 2020

Configuring mediated devices (Part 1)

vfio-mdev has become popular over the last few years for assigning certain classes of devices to guests. On the s390x side, vfio-ccw and vfio-ap are using the vfio-mdev framework for making channel devices and crypto adapters accessible to guests.
This and a follow-up article aim to give an overview of the infrastructure, how to set up and manage devices, and how to use tooling for this.

What is a mediated device?

A general overview

Mediated devices grew out of the need to build upon the existing vfio infrastructure in order to support more fine grained management of resources. Some of the initial use cases included GPUs and (maybe somewhat surprisingly) s390 channel devices.

When using the mediated device (mdev) API, common tasks are performed in the mdev core driver (like device management), while device-specific tasks are done in a vendor driver. Current in-kernel examples of vendor drivers are the Intel vGPU driver, vfio-ccw, and vfio-ap.

Examples on s390

vfio-ccw

vfio-ccw can be used to assign channel devices. It is pretty straightforward: vfio-ccw is an alternative driver for I/O subchannels, and a single mediated device per subchannel is supported.

vfio-ap

vfio-ap can be used to assign crypto cards/queues (APQNs). It is a bit more involved, requiring prior setup on the ap bus level and configuration of a 'matrix' device. Complex relationships between the resources that can be assigned to different guests exist. Configuration-wise, this is probably the most complex mediated device available today.

Configuring a mediated device: the manual way

Mediated devices can be configured manually via sysfs operations. This is a good way to see what actually happens, but probably not what you want to do as a general administration task. Tools to help here will be introduced in part 2 of this article.

I will show the steps for both vfio-ccw and vfio-ap, just to show two different approaches. (Both examples are also used in the QEMU documentation, in case this looks familiar.)

Binding to the correct driver

vfio-ccw

Assume you want to use a DASD with the device bus ID 0.0.2b09. As vfio-ccw operates on the subchannel level, you first need to locate the subchannel for this device:

   [root@host ~]# lscss | grep 0.0.2b09 | awk '{print $2}'
  0.0.0313

(A word of caution: a device is not guaranteed to use the same subchannel at all times; on LPARs, the subchannel number will usually be stable, but z/VM -- and QEMU -- assign subchannel numbers in a consecutive order. If you don't get any hotplug events for a device, the subchannel number will stay stable for at least as long as the guest is running, though.)

Now you need to unbind the subchannel device from the default I/O subchannel driver and bind it to the vfio-ccw driver (make sure the device is not in use!):

    [root@host ~]# echo 0.0.0313 > /sys/bus/css/devices/0.0.0313/driver/unbind
    [root@host ~]# echo 0.0.0313 > /sys/bus/css/drivers/vfio_ccw/bind

vfio-ap

You need to perform some preliminary configuration of your crypto adapters before you can use any of them with vfio-ap. If nothing different has been set up, a crypto adapter will only bind to the default device drivers, and you cannot use it via vfio-ap. In order to be able to bind an adapter to vfio-ap, you first need to modify the /sys/bus/ap/apmask and /sys/bus/ap/aqmask entries. Both are basically bitmasks that indicate that the matching adapter IDs respectively queue indices can only be bound to the default drivers. If you want to use a certain APQN via vfio-ap, you need to unset the respective bits.

Let's assume you want to assign the APQNs (5, 4) and (5, ab). First, you need to make the adapter and the domains available to non-default drivers:

  [root@host ~]#  echo -5 > /sys/bus/ap/apmask
  [root@host ~]#  echo -4, -0xab > /sys/bus/ap/aqmask

This should result in the devices being bound to the vfio_ap driver (you can verify this by looking for them under /sys/bus/ap/drivers/vfio_ap/).

Create a mediated device

The basic workflow is "pick a uuid, create a mediated device identified by it".

vfio-ccw

For vfio-ccw, the two steps of the basic workflow are enough:

  [root@host ~]# uuidgen
  7e270a25-e163-4922-af60-757fc8ed48c6
  [root@host ~]# echo "7e270a25-e163-4922-af60-757fc8ed48c6" > \
    /sys/bus/css/devices/0.0.0313/mdev_supported_types/vfio_ccw-io/create

vfio-ap

For vfio-ap, you need a more involved approach. The uuid is used to create a mediated device under the 'matrix' device:

  [root@host ~] # uuidgen
  669d9b23-fe1b-4ecb-be08-a2fabca99b71
 [root@host ~]# echo "669d9b23-fe1b-4ecb-be08-a2fabca99b71" > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/create

This mediated device will need to collect all APQNs that you want to pass to a specific guest. For that, you need to use the assign_adapter, assign_domain, and possibly assign_control_domain attributes (we'll ignore control domains for simplicity's sake.) All attributes have a companion unassign_ attribute to remove adapters/domains from the mediated device again. You can only assign adapters/domains that you removed from apmask/aqmask in the previous step. To follow up on our example again:

  [root@host ~]# echo 5 > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_adapter
 [root@host ~]# echo 4 > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_domain
 [root@host ~]# echo 0xab > /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/assign_domain

If you want to make sure that the mediated device is set up correctly, check via

  [root@host ~]# cat /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/669d9b23-fe1b-4ecb-be08-a2fabca99b71/matrix
  05.0004
  05.00ab

Configuring QEMU/libvirt

Your mediated device is now ready to be passed to a guest.

vfio-ccw

Let's assume you want the device to show up as device 0.0.1234 in the guest.

For the QEMU command line, use

-device vfio-ccw,devno=fe.0.1234,sysfsdev=\
    /sys/bus/mdev/devices/7e270a25-e163-4922-af60-757fc8ed48c6

For libvirt, use the following XML snippet in the <devices> section:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-ccw'>
  <source>
    <address uuid='7e270a25-e163-4922-af60-757fc8ed48c6'/>
  </source>
  <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x1234'/>
</hostdev>

vfio-ap

Any APQNs will show up in the guest exactly as they show up in the host (i.e., no remapping is possible.)

For the QEMU command line, use

-device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/669d9b23-fe1b-4ecb-be08-a2fabca99b71

For libvirt, use the following XML snippet in the <devices> section:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-ap'>
  <source>
    <address uuid='669d9b23-fe1b-4ecb-be08-a2fabca99b71'/>
  </source>
</hostdev>

Tooling

All this manual setup is a bit tedious; the next article in this series will look at some of the tooling that is available for mediated devices.

Friday, July 10, 2020

s390x changes in QEMU 5.1

QEMU has entered softfreeze for 5.1, so it is time to summarize the s390x changes in that version.

Protected virtualization

One of the biggest features on the s390/KVM side in Linux 5.7 had been protected virtualization aka secure execution, which basically restricts the (untrusted) hypervisor from accessing all of the guest's memory and delegates many tasks to the (trusted) ultravisor. QEMU 5.1 introduces the QEMU part of the feature.
In order to be able to run protected guests, you need to run on a z15 or a Linux One III, with at least a 5.7 kernel. You also need an up-to-date s390-tools installation. Some details are available in the QEMU documentation. For more information about what protected virtualization is, watch this talk from KVM Forum 2019 and this talk from 36C3.

vfio-ccw

vfio-ccw has also seen some improvements over the last release cycle.
  • Requests that do not explicitly allow prefetching in the ORB are no longer rejected out of hand (although the kernel may still do so, if you run a pre-5.7 version.) The rationale behind this is that most device drivers never modify their channel programs dynamically, and the one common code path that does (IPL from DASD) is already accommodated by the s390-ccw bios. While you can instruct QEMU to ignore the prefetch requirement for selected devices, this is an additional administrative complication for little benefit; it is therefore no longer required.
  • In order to be able to relay changes in channel path status to the guest, two new regions have been added: a schib region to relay real data to stsch, and a crw region to relay channel reports. If, for example, a channel path is varied off on the host, all guests using a vfio-ccw device that uses this channel path now get a proper channel report for it.

Other changes

Other than the bigger features mentioned above, there have been the usual fixes, improvements, and cleanups, both in the main s390x QEMU code and in the s390-ccw bios.