Monday, February 19, 2018

Notes on PCI on s390x

As QEMU 2.12 will finally support PCI devices under s390x/tcg, I thought now is a good time to talk about some of the peculiarities of PCI on the mainframe.

Oddities of PCI on s390x architecture

Oddity #1: No MMIO, but instructions

Everywhere else, you use MMIO when interacting with PCI devices. Not on s390x; you have a set of instructions instead. For example, if you want to read or write memory, you will need to use the PCILG or PCISTG instructions, and for refreshing translations, you will need to use the RPCIT instructions. Fortunately, these instructions can be matched to the primitives in the Linux kernel; unfortunately, all those instructions are privileged, which leads us to

Oddity #2: No user space I/O

As any interaction with PCI devices needs to be done through privileged instructions, Linux user space can't interact with the devices directly; the Linux kernel needs to be involved in every case. This means that there are none of the PCI user space implementations popular on other platforms available on s390x.

Oddity #3: No topology, but FID and UID

Usually, you'll find busses, slots and functions when you identify a certain PCI function. The PCI instructions on s390x, however, don't expose any topology to the caller. This means that an operating system will get a simple list of functions, with a function id (FID) that can be mapped to a physical slot and an UID, which the Linux kernel will map to a domain number. A PCI identifier under Linux on s390x will therefore always be of the form <domain>:00:00.0.

Implications for the QEMU implementation of PCI on s390x

In order to support PCI on s390x in QEMU, some specialties had to be implemented.

Instruction handlers

Under KVM, every PCI instruction intercepts and is routed to user space. QEMU does the heavy lifting of emulating the operations and mapping to generic PCI code. This also implied that PCI under tcg did not work until the instructions had been wired up; this has now finally happened and will be in the 2.12 release.

Modelling and (lack of) topology

QEMU PCI code expects the normal topology present on other platforms. However, this (made-up) topology will be invisible to guests, as the PCI instructions do not relay it. Instead, there is a special "zpci" device with "fid" and "uid" properties that can be linked to a normal PCI device. If no zpci device is specified, QEMU will autogenerate the FID and the UID.

How can I try this out?

If you do not have a real mainframe with a real PCI card, you can use virtio-pci devices as of QEMU 2.12 (or current git as of the time of this writing). If you do have a mainframe and a PCI card, you can use vfio-pci (but not yet via libvirt).
Here's an example of how to specify a virtio-net-pci device for s390x, using tcg:
s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on (...) -device zpci,uid=12,fid=2,target=vpci02,id=zpci2 -device virtio-net-pci,id="vpci02",addr=0x2
Some notes on this:
  • You need to explicitly enable the "zpci" feature in the qemu cpu model. Two other features, "aen" and "ais", are enabled by default ("aen" and "zpci" are mandatory, "ais" is needed for Linux guest kernels prior to 4.15. If you use KVM, the host kernel also needs support for ais.)
  • The zpci device is joined with the PCI device via the "target" property. The virtio-net-pci device does not know anything about zpci devices.
  • Only virtio-pci devices using MSI-X will work on s390x.
In the guest, this device will show up in lspci -v as
000c:00:00.0 Ethernet controller: Red Hat, Inc. Virtio network device
Subsystem: Red Hat, Inc. Device 0001
Physical Slot: 00000002
 Note how the uid of 12 shows up as domain 000c and the fid of 2 as physical slot 00000002.

Thursday, November 9, 2017

Notes from KVM Forum 2017

KVM Forum 2017 took place in Prague Oct 25 - 27 and I had the pleasure of attending. Let me share some of my notes and observations (not exhaustive in any way).

General notes

KVM Forum this year was quite large, but with enough space to sit down and talk to others (or do some hacking). As always, the hallway track was great to meet some people (both old acquaintances and folks you never met in real life before) and to discuss things face-to-face that would take more time done via the mailing lists or IRC.

The first day featured a single track (shared with OSS Europe), and also the invitation-only QEMU summit in the morning (minutes will be posted to qemu-devel). Days two and three were dual-track except for the first sessions. Obviously, this means I was only able to see a subset of sessions (and one also needs a break sometimes...); fortunately, videos are slowly making their way unto youtube (check here for updates).


Christian Bornträger presented the KVM status, Paolo Bonzini the QEMU status and Peter Krempa the libvirt status. We seem to have a healthy development community, and interesting new topics still come up.


I listened with interest to Christoffer Dall's talks about KVM on ARM: Reducing hypervisor overhead, and nested virtualization. Seeing the virtualization architecture on ARM makes me really glad to work on s390, and it makes the efforts of the ARM folks all the more impressive (writing code for an architecture revision for which no hardware yet exists... oh dear).


Virtio is currently moving towards the 1.1 revision of the standard, with one of the biggest changes a new ring layout. Jens Freimann (who took over last minute from Michael S. Tsirkin, who unfortunately could not make it) presented about this ongoing work, and also gave some tips on how to get changes included in the standard (let me point to the OASIS virtio TC here). There was also a presentation about virtio-crypto, which I unfortunately was not able to attend.

VFIO (and the mediated device framework) continues to be a topic of interest. There were talks about enabling migration, buffer sharing, and adding support for a new platform bus. On a related note, we (the s390 maintainers) were able to sit down with Alex Williamson and Daniel Berrange to discuss the vfio-ap proposal for s390 crypto cards; this approach has a good chance of being workable.

Hannes Reinecke and Paolo Bonzini presented about the challenges virtualizing SCSI Fibre Channel (NPIV): A good overview of what the challenges are and how we might be able to solve them.

Also of note: Improving virtio-blk performance with polling, introducing a paravirtualized RDMA device, and vhost-user-scsi and SPDK.


I'm currently trying to learn more about tcg and used the opportunity to attend two talks.

Alessandro Di Federico started his talk with a good general introduction to tcg. His work to split out a libtcg is interesting, but it might be done at the wrong level for usage with QEMU (or so I understood from the discussion.)

I also enjoyed Alex BennĂ©e's talk about handling vectors in tcg (complete with an historical overview of vector instructions).


David Gilbert presented on the various reasons a migration might fail. Most migrations don't fail, so people tend to do it more and more in an automated way. If you have the misfortune of having a migration actually do fail on you, his talk gives a lot of pointers on how to find out why it happened (and hopefully, avoiding the problem in the future.)

Markus Armbruster gave an overview of the current QEMU command line infrastructure and how to improve it via QAPIfication (we probably need to sacrifice some rubber chickens backward compatibility).

Also: KubeVirt, running large deployments via libvirt, and the effects of changed defaults on users.


KVM Forum 2017 was really interesting (if somewhat exhausting) for learning about new developments and discussing with people; if you are working in the area, I can only recommend trying to attend KVM Forum.

Tuesday, August 8, 2017

Channel I/O: More about channel paths

recent discussion on qemu-devel touched upon some aspects of channel paths and their handling (or not-handling) in QEMU. I will try to summarize and give some further information here.

I previously published some information on channel paths here. This post will concentrate a bit more on aspects that are not yet relevant in QEMU, but may become so in the future.

To recap: Channel paths represent the means by which the mainframe talks to the device - it (somewhat) corresponds to the actual cabling. Let's take a look at the output of lscss on a z/VM guest as an actual example:

Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
0.0.0150 0.0.0000  0000/00 3088/08      80  80  ff   08000000 00000000
0.0.0151 0.0.0001  0000/00 3088/08      80  80  ff   08000000 00000000
0.0.8000 0.0.0002  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8001 0.0.0003  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8002 0.0.0004  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8003 0.0.0005  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.8004 0.0.0006  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.8005 0.0.0007  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.0191 0.0.0008  3390/0a 3990/e9      e0  e0  ff   2a3a0900 00000000
0.0.208f 0.0.0009  3390/0c 3990/e9 yes  e0  e0  ff   3a2a1a00 00000000
0.0.218f 0.0.000a  3390/0c 3990/e9 yes  e0  e0  ff   2a3a0900 00000000
0.0.228f 0.0.000b  3390/0c 3990/e9 yes  e0  e0  ff   2a3a1a00 00000000
0.0.238f 0.0.000c  3390/0c 3990/e9 yes  e0  e0  ff   093a2a00 00000000
0.0.000c 0.0.000d  0000/00 2540/00      80  80  ff   08000000 00000000
0.0.000d 0.0.000e  0000/00 2540/00      80  80  ff   08000000 00000000
0.0.000e 0.0.000f  0000/00 1403/00      80  80  ff   08000000 00000000
0.0.0009 0.0.0010  0000/00 3215/00 yes  80  80  ff   08000000 00000000
0.0.0190 0.0.0011  3390/0a 3990/e9      e0  e0  ff   3a2a1a00 00000000
0.0.019d 0.0.0012  3390/0a 3990/e9      e0  e0  ff   093a2a00 00000000
0.0.019e 0.0.0013  3390/0a 3990/e9      e0  e0  ff   093a2a00 00000000
0.0.0592 0.0.0014  3390/0a 3990/e9      e0  e0  ff   3a2a1a00 00000000
0.0.ffff 0.0.0015  9336/10 6310/80      80  80  ff   08000000 00000000

A couple of interesting observations with regard to channel paths can be made here:
  • Devices 0.0.0150/0.0.0151, 0.0.000c/0.0.000d, 0.0.000e, 0.0.0009, and 0.0.ffff all share the same channel path, 08, despite being of different types (virtual CTC, virtual card punch/card reader/printer, virtual console, and virtual FBA DASD). This is because they are all emulated devices, and z/VM chooses to use the same virtual channel path for them.
  • Devices 0.0.8000 - 0.0.8002 uses channel path 0 as their only channel path, as can be seen by the PIM being 80.
  • Devices 0.0.8000 - 0.0.8002 and 0.0.8003-0.0.8005 use the same channel path, respectively; that is because they make up the device triplet for an OSA device.
  • The remaining devices (all ECKD DASD) use several channel paths (09, 1a, 2a, 3a), but only three at a time (as evidenced by the PIM of 0e), and also in different combination. This is probably a quirk of the individual setup for this guest.
The output of lschp of the same guest looks like this:

CHPID  Vary  Cfg.  Type  Cmg  Shared  PCHID
0.00   1     1     11    -    -      (ff00)
0.01   1     1     11    -    -      (ff01)
0.08   1     1     1a    -    -       0598 
0.09   1     1     1a    -    -       0599 
0.0a   1     1     1a    -    -       059c 
0.0b   1     1     25    -    -       059d 
0.0c   1     1     1a    -    -       05ac 
0.17   1     1     11    -    -       05b4 
0.18   1     1     1a    -    -       05a0 
0.19   1     1     1a    -    -       05a1 
0.1a   1     1     1a    -    -       05a4 
0.1b   1     1     25    -    -       05a5 
0.1c   1     1     1a    -    -       05ad 
0.28   1     1     1a    -    -       05d8 
0.29   1     1     1a    -    -       05d9 
0.2a   1     1     1a    -    -       05dc 
0.2b   1     1     25    -    -       05dd 
0.2c   1     1     1a    -    -       05d0 
0.34   1     1     11    -    -       05ec 
0.35   1     1     11    -    -       05a8 
0.38   1     1     1a    -    -       05e0 
0.39   1     1     1a    -    -       05e1 
0.3a   1     1     1a    -    -       05e4 
0.3b   1     1     25    -    -       05e5 
0.3c   1     1     1a    -    -       05d1 
0.60   1     1     24    -    -      (070c)
0.61   1     1     24    -    -      (070d)
0.62   1     1     24    -    -      (070e)
0.63   1     1     24    -    -      (070f)

Here we find the various channel paths again, together with more:
  • There are several channel paths that are available to the guest, but not in use by any device currently available to the guest (and therefore not turning up in the output of lscss).
  • Channel paths 00 and 01 (used by the OSA cards) use an internal channel (the number in the last column are in brackets) - we can therefore conclude that the cards are virtualized by z/VM.
  • The channel path 08 (which is referenced by all virtual devices) is actually backed by a physical path (0598). I frankly have no idea why z/VM is doing that.
  • The channel paths used by the ECKD DASD (09, 1a, 2a, 3a) all are of the same type (1a - FICON, IIRC) and are backed by different physical paths (last column).
Various modifications can be done to the channel paths; under Linux, the chchp tool is useful for that. Let's try to vary off a path:

chchp -v 0 0.3a
Vary offline 0.3a... done.

lschp shows the changed state for the path:

0.3a   0     1     1a    -    -       05e4

The lscss output remains unchanged - which isn't surprising as doing a vary off only affects the state of the channel path within Linux: Linux will no longer use the path for I/O, but the path masks as managed by the hardware and z/VM are not changed.

Let's try to configure off another path:

chchp -c 0 0.2a
Configure standby 0.2a... failed - attribute value not as expected

That did not work as expected. Why? This is supposed to issue a SCLP command to set the channel path to standby - but the my guest apparently does not have the rights or ability to do so. Which is a pity, as I would have liked to show the effects of configuring a channel path to standby:
  • It (unsurprisingly) changes the state in lschp.
  • It also changes the path masks, as shown in lscss.
  • It may generate a machine check with a channel report word (CRW) that informs the OS that something has happened to the channel path - this is dependent upon the environment, however.
So let's stop here. I'll continue with another setup, once I have it.

Thursday, June 1, 2017

Linux 4.12 and QEMU 2.10 will have basic support for vfio-ccw

If you want to passthrough some channel devices to your guest, you will be able to do so with a host kernel >= 4.12 and a QEMU >= 2.10.

For some hints about configuration and restrictions, see this entry in the QEMU wiki.

Wednesday, May 10, 2017

Channel I/O: Types of devices

The last posts in this series tried to examine some basic principles of channel I/O. But what kinds of devices are actually available?

I'll focus on the device types that are available to a guest running under QEMU.

The most important channel devices for QEMU guests (and often, the only ones present in a guest) are virtio-ccw devices. These have been used in the previous examples. Think of them as the channel I/O equivalent of virtio-pci devices: that is, a device that is discoverable in the guest and acts as a means to access the virtio device.
All virtio-ccw devices share the following characteristics:
  • Fully virtual (i.e., fully emulated). There is no "real hardware" virtio-ccw device.
  • A control unit type of 0x3832.
  • One virtual channel path, type 0x32.

New (well, to QEMU) and just recently added (will be in 2.10) are 3270 devices (the channel-attached variety). The classic green-screen console; some details about what works (and what yet doesn't) and how to set this up may be found in the QEMU wiki.
3270 devices have the following characteristics:
  • Fully virtual (emulated). While you could passthrough 3270 devices while running under z/VM, this depends on the non-yet-merged vfio-ccw infrastructure (see below) and does not really make much sense.
  • A control unit type of 0x3270.
  • One virtual channel path, type 0x1a.

Still being worked on, but on a good track, is the vfio-ccw infrastructure. The kernel part has been merged for 4.12, the QEMU part will hopefully be merged soon. vfio-ccw brings the same functionality to channel devices that vfio-pci brought to pci devices: Give hardware devices to the guest to use. This is still quite experimental, and has only really been tested with one device type yet: ECKD DASD.
'DASD' basically refers to disks; this wikipedia article explains more. 'ECKD' refers to the data recording format; this wikipedia article probably explains more than you ever wanted to know. Linux accesses ECKD DASD as block devices with some minor oddities.
If you pass through an ECKD DASD, you can expect the following to show up in your guest:
  • A device that corresponds one-to-one to a device on the host, although it might have a different device number (depending on how it was configured).
  • A control unit and device type corresponding to ECKD DASD.  A control unit type of 0x3990 and a device type of 0x3390 are the most likely.
  • One to eight channel paths, corresponding to real channel paths.
vfio-ccw opens the way to expose all kinds of channel devices to QEMU/KVM guests: FBA DASD, channel-attached tapes - basically everything that is supported by Linux on the host.

Wednesday, April 12, 2017

SELinux vs. QEMU/KVM

Trying to run QEMU/KVM under an older version of z/VM? Make sure to read Thomas' hint.

Thursday, March 30, 2017

Oldies but Goldies: Channel I/O KVM Forum 2012 talk

Some of the information has been superseded in the meanwhile, but the slides from my talk at the 2012 KVM Forum contain some information that may still be interesting. (Sadly, no video of the talk was recorded.)