Email: email@example.com • Twitter: @benhutchingsuk • Debian: benh • Gitweb: git.decadent.org.uk • Github: github.com/bwhacks
This conference only has a single track, so I attended almost all the talks. This time I didn't take notes but I've summarised all the talks I attended. This is the second and last part of that; see part 1 if you missed it.
Speaker: Jesper Dangaard Brouer
Details and slides: https://kernel-recipes.org/en/2019/xdp-closer-integration-with-network-stack/
The speaker introduced XDP and how it can improve network performance.
The Linux network stack is extremely flexible and configurable, but this comes at some performance cost. The kernel has to generate a lot of metadata about every packet and check many different control hooks while handling it.
The eXpress Data Path (XDP) was introduced a few years ago to provide a standard API for doing some receive packet handling earlier, in a driver or in hardware (where possible). XDP rules can drop unwanted packets, forward them, pass them directly to user-space, or allow them to continue through the network stack as normal.
He went on to talk about how recent and proposed future extensions to XDP allow re-using parts of the standard network stack selectively.
This talk was supposed to be meant for kernel developers in general, but I don't think it would be understandable without some prior knowledge of the Linux network stack.
Speaker: Jens Axboe
Details and slides: https://kernel-recipes.org/en/2019/talks/faster-io-through-io_uring/
Video: Youtube. (This is part way through the talk, but the earlier part is missing audio.)
The normal APIs for file I/O, such as
blocking, i.e. they make the calling thread sleep until I/O is
complete. There is a separate kernel API and library for asynchronous
I/O (AIO), but it is very restricted; in particular it only supports
direct (uncached) I/O. It also requires two system calls per
operation, whereas blocking I/O only requires one.
io_uring API was introduced as an entirely new API for
asynchronous I/O. It uses ring buffers, similar to hardware DMA
rings, to communicate operations and completion status between
user-space and the kernel, which is far more efficient. It also
removes most of the restrictions of the current AIO API.
The speaker went into the details of this API and showed performance comparisons.
Speaker: Bradley Kuhn
The speaker talked about the importance of the GNU GPL to the development of Linux, in particular the ability of individual developers to get complete source code and to modify it to their local needs.
He described how, for a large proportion of devices running Linux, the complete source for the kernel is not made available, even though this is required by the GPL. So there is a need for GPL enforcement—demanding full sources from distributors of Linux and other works covered by GPL, and if necessary suing to obtain them. This is one of the activities of his employer, Software Freedom Conservancy, and has been carried out by others, particularly Harald Welte.
In one notable case, the Linksys WRT54G, the release of source after a lawsuit led to the creation of the OpenWRT project. This is still going many years later and supports a wide range of networking devices. He proposed that the Conservancy's enforcement activity should, in the short term, concentrate on a particular class of device where there would likely be interest in creating a similar project.
Speaker: Eric Leblond
Details and slides: https://kernel-recipes.org/en/2019/talks/suricata-and-xdp/
The speaker described briefly how an Intrusion Detection System (IDS) interfaces to a network, and why it's important to be able to receive and inspect all relevant packets.
He then described how the Suricata IDS uses eXpress Data Path (XDP, explained in an earlier talk) to filter and direct packets, improving its ability to handle very high packet rates.
Speaker: Greg Kroah-Hartman
Details and slides: https://kernel-recipes.org/en/2019/talks/cves-are-dead-long-live-the-cve/
Common Vulnerabilities and Exposures Identifiers (CVE IDs) are a standard, compact way to refer to specific software and hardware security flaws.
The speaker explained problems with the way CVE IDs are currently assigned and described, including assignments for bugs that don't impact security, lack of assignment for many bugs that do, incorrect severity scores, and missing information about the changes required to fix the issue. (My work on CIP's kernel CVE tracker addresses some of these problems.)
The average time between assignment of a CVE ID and a fix being published is apparently negative for the kernel, because most such IDs are being assigned retrospectively.
He proposed to replace CVE IDs with "change IDs" (i.e. abbreviated git commit hashes) identifying bug fixes.
Speaker: Enric Balletbo i Serra
The speaker talked about how the Chrome OS developers have tried to reduce the difference between the kernels running on Chromebooks, and the upstream kernel versions they are based on. This has succeeded to the point that it is possible to run a current mainline kernel on at least some Chromebooks (which he demonstrated).
Speaker: Daniel Bristot de Oliveira
Details and slides: https://kernel-recipes.org/en/2019/talks/formal-modeling-made-easy/
The speaker explained how formal modelling of (parts of) the kernel could be valuable. A formal model will describe how some part of the kernel works, in a way that can be analysed and proven to have certain properties. It is also necessary to verify that the model actually matches the kernel's implementation.
He explained the methodology he used for modelling the real-time
scheduler provided by the
PREEMPT_RT patch set. The model used a
number of finite state machines (automata), with conditions on state
transitions that could refer to other state machines. He added (I
think) tracepoints for all state transitions in the actual code and a
kernel module that verified that at each such transition the model's
conditions were met.
In the process of this he found a number of bugs in the scheduler.
Speaker: Jonathan Corbet
Details and slides: https://kernel-recipes.org/en/2019/kernel-documentation-past-present-and-future/
The speaker is the maintainer of the Linux kernel's in-tree documentation. He spoke about how the documentation has been reorganised and reformatted in the past few years, and what work is still to be done.
Speaker: Jose E Marchesi
The speaker introduced and demonstrated his project, the "poke" binary editor, which he thinks is approaching a first release. It has a fairly powerful and expressive language which is used for both interactive commands and scripts. Type definitions are somewhat C-like, but poke adds constraints, offset/size types with units, and types of arbitrary bit width.
The expected usage seems to be that you write a script ("pickle") that defines the structure of a binary file format, use poke interactively or through another script to map the structures onto a specific file, and then read or edit specific fields in the file.
I was assigned 20 hours of work by Freexian's Debian LTS initiative and worked all those hours this month.
I prepared and, after review, released Linux 3.16.74, including various security and other fixes. I then rebased the Debian package onto that. I uploaded that with a small number of other fixes and issued DLA-1930-1.
I backported the latest security update for Linux 4.9 from stretch to jessie and issued DLA-1940-1 for that.
This conference only has a single track, so I attended almost all the talks. This time I didn't take notes but I've summarised all the talks I attended.
Updated: Noted slides are available for all talks. Added links to the video streams.
Speaker: Steven Rostedt
This talk explains how the kernel's function tracing mechanism (ftrace) works, and describes some of its development history.
It was quite interesting, but you probably don't need to know this stuff unless you're touching the ftrace implementation.
Speakers: Dodji Seketeli, Jessica Yu, Matthias Männich
The upstream kernel does not have a stable ABI (or API) for use by modules, but OS distributors often want to support the use of out-of-tree modules by ensuring that at least some subset of the kernel ABI remains stable within a given OS release.
Currently the kernel build process generates a "version" or "CRC" for each exported symbol by parsing the relevant type definitions. There is a load-time ABI check based on comparing these, and distributors can compare them at build time to detect ABI breaks. However this doesn't work that well and it's hard to work out what caused a change.
The speaker develops the "libabigail" library and tools. These can extract ABI definitions from standard debug information (DWARF), and then analyse and compare ABIs for different versions of a shared libraries, or of the Linux kernel and modules. They are likely to replace the kernel's current symbol versioning approach at some point. He talked about the capabilities of libabigail, plans for improving it, and some limitations of C ABI checkers.
Speaker: Alexei Starovoitov
Details and slides: https://kernel-recipes.org/en/2019/talks/bpf-at-facebook/
The Berkeley Packet Filter (BPF) is a simple virtual machine implemented by several kernels. It allows user-space to add code that runs in kernel context, without compromising the integrity of the kernel.
In recent years Linux has extended this virtual machine architecture to create eBPF, which is expressive enough to be targeted by general-purpose compilers such as Clang and (in the near future) gcc. eBPF can be used for filtering network packets (the original purpose of BPF), tracing events, and many other purposes.
The speaker talked about practical experiences using eBPF with tracing at Facebook. These mainly involved investigating performance problems. He also talked about the difficulties of doing this on production servers without developer tools installed, and how this is being addressed.
Speaker: Thomas Gleixner
Details and slides: https://kernel-recipes.org/en/2019/talks/kernel-hacking-behind-closed-doors/
The speaker talked about how kernel developers and hardware vendors have been handling speculative execution vulnerabilities, and the friction between how the vendors' preferred process and the usual kernel development processes.
He described the mailing list manager he wrote to support discussion of security issues with a long embargo period, which sends and receives encrypted messages in both S/MIME and PGP/MIME formats (depending on the subscriber).
Finally he talked about the process that has been settled on for handling future issues of this time with minimal legal paperwork.
This was somewhat marred by a lawyer joke and a generally combative attitude to hardware vendors.
Speaker: Rafael Wysocki
The Linux device model represents all devices as a simple hierarchy. Driver binding and unbinding (probe/remove), and power management operations, are sequenced based on the assumption that a device only depends on its parent in the device model.
On PCs, additional dependencies are often hidden behind abstractions such as ACPI, so that Linux does not need to be aware of them. On most embedded systems, however, such abstractions are usually missing and Linux does need to be aware of additional dependencies.
(A few years ago, the device driver core gained support for an error
code from probe (
-EPROBE_DEFER) that indicates that some dependency
is not yet bound, and causes the device to be re-probed later. But
this is an incomplete, stop-gap solution.)
The speaker described the new "device links" API which provides a way to record additional dependencies in the device model. The device driver core will use this information to sequence operations on multiple devices correctly.
Speaker: Aurélien Rougemont
Details and slides: https://kernel-recipes.org/en/2019/metrics-are-money/
The speaker talked about several instances from his experience where system metrics were used to justify buying or rejecting new hardware. In some cases, these metrics were not accurate or consistent, which could lead to bad decisions. He made a plea for better documentation of metrics reported by the Linux kernel.
Speaker: Julien Thierry
Linux typically uses Non-Maskable Interrupts (NMIs) for Performance Monitoring Unit (PMU) interrupts. NMIs are (almost) never disabled, so this allows interrupt handlers and other code that runs with interrupts disabled to be profiled accurately. On architectures that do not have NMIs, typically Linux can use the highest interrupt priority for this instead, and only mask the lower priorities.
On the Arm architecture, there is no NMI but there are two architectural interrupt priority levels (IRQ and FIQ). However on 64-bit Arm systems FIQ is typically reserved to system firmware so Linux only uses IRQ. This results in inaccurate profiling.
The speaker described the implementation of a pseudo-NMI for 64-bit Arm. This is done by leaving IRQs enabled on the CPU and masking them selectively on the Arm generic interrupt controller (GIC), which supports many more priority levels. However this effectively requires GIC v3 or v4 because these operations are prohibitively slow on earlier versions.
Speaker: Jean Delvare
The speaker talked about the history of standardised DRAM modules (SIMMs and DIMMs) and how system firmware can detect them and find out their size and timing requirements.
DIMMs expose this information through Serial Presence Detect (SPD) which until recently used standard 256-byte I²C EEPROMs.
For the latest generation of DIMMs (DDR4), the configuration information can be larger than 256 bytes and a new interface was required. Jean described and criticised this interfaces.
He also talked about the Linux drivers and utilities that can be used to read the SPD EEPROMs.
Speaker: Peter Zijlstra
LWN article: https://lwn.net/Articles/799454/
This was about restricting which tasks share a core on CPUs with SMT/hyperthreading. There is current interest in doing this as a mitigation for speculation leaks, instead of disabling SMT altogether.
SMT also makes single-thread processing speed quite unpredictable, which is bad for RT, so it would be useful to prevent scheduling any other tasks on the same core as an RT task.
Speakers: Jim Hull and Betty Dall of HPE
Connections are point-to-point between "components". Switch components provide fan-out.
Components can be subdivided into "resources" and also have "interfaces".
No requirement for a single root (like typical PCIe) and there can be redundant connections forming a mesh.
Fabric can span multiple logical computers (OS instances). Fabric manager assigns components and resources to them, and configures routing.
Protocol is reliable; all writes are acknowledged (by default). However it is not ordered by default.
Components have single control space (like config space?) and single data space (up to 2⁶⁴ bytes). Control space has a fixed header and then additional structures for optional and per-interface registers.
Each component has 12-bit component ID (CID) which may be combined with 16-bit subnet ID (SID) for 28-bit global component ID (GCID).
Coherence is managed by software.
Bridge from CPU to Gen-Z needs MMUs to map between local physical address space and fabric address space. Normally also has DMA engines ("data movers") that can send and receive all types of Gen-Z packets and not just read/write. These bridges are configured by the local OS instance, not the fabric manager.
Should behave similarly to PCI and USB, so far as possible. Leave policy to user-space. Deal with the fact that most features are optional.
The Gen-Z subsystem needs to provide APIs for tracking PASIDs in IOMMU and ZMMU. Similar requirements in PCIe; should this be generic?
How can Gen-Z device memories be mapped with huge pages?
Undecided whether a generic kernel API for data movers is desirable. This would help kernel I/O drivers but not user-space I/O (like RDMA).
Interrupts work very differently from MSI. Bridge may generate interrupts for explicit interrupt packets, data mover completions, and Unsolicited Event Packets (link change, hotplug, …).
All nodes run local management services. On Linux these will be in user-space (LLaMaS).
(This means LLaMaS will need to be included in the initramfs if the boot device is attached through Gen-Z.)
Manager will use netlink to announce when resource has been assigned to the local node. Kernel then creates kernel device for it.
Moderator: Joe Lawrence
Speaker: Dmitry Vyukov
Dmitry outlined how the current kernel development processes are failing:
It takes a long time for new developers to become productive, or for developers to contribute to unfamiliar subsystems.
(None of this was new to me, but spelling out all these issues definitely had an impact.)
He advocates more consolidation and consistency, so that:
There was further discussion of this at the Kernel Maintainer Summit, reported in https://lwn.net/Articles/799134/.
Here's the second chunk of notes I took at Linux Plumbers Conference earlier this month. Part 1 covered the Distribution kernels track.
Moderators: George Wilson and Serapheim Dimitropoulos from Delphix; Omar Sandoval from Facebook
Problem: ability to easily anlyse failures in production (live system) or post-mortem (crash dump).
Debuggers need to:
Most people present use crash; one mentioned crash-python (aka pycrash) and one uses kgdb.
crash-python is a Python layer on top of a gdb fork. Uses libkdumpfile to decode compressed crash-dumps.
drgn (aka Dragon) is a debugger-as-a-library. Excels in introspectiion of live systems and crash-dumps, and covers both kernel and user-space. It can be extended through Python. As a library it can be imported and used from the Python REPL.
sdb is Deplhix's front-end to drgn, providing a more shell-like interactive interface. Example of syntax:
> modules | filter obj.refcnt.counter > 10 | member name
Currently it doesn't always have good type information for memory. A raw virtual address can be typed using the "cast" command in a pipeline. Hoping that BTF will allow doing better.
Allows defining pretty-print functions, though it appears these have to be explciitly invoked.
Answering tough questions:
Some discussion around the fact that drgn has a lot of code that's dependent on kernel version, as internal structures change. How can it be kept in sync with the kernel? Could some of that code be moved into the kernel tree?
Omar (I think) said that his approach was to make drgn support multiple versions of structure definitions.
Q: How does this scale to the many different kernel branches that are used in different distributions and different hardware platforms?
A: drgn will pick up BTF structure definitions. When BTF is available the code only needs to handle addition/removal of members it accesses.
Brendan Gregg made a plea to distro maintainers to enable BTF.
Moderator: Hans de Goede of Red Hat
Pain points and missing pieces with Wayland, or specifically GNOME Shell:
ssh -X. Pipewire goes some way to the solution. The whole desktop can be remoted over RDP which can be tunnelled over SSH.
Speaker: Peter Robinson of Red Hat
Can now use u-boot with UEFI support on most Arm hardware. Much easier to use a common kernel on multiple hardware platforms, and UEFI boot can be assumed.
"Enterprise" and "industrial" IoT is not a Raspberry Pi. Problems result from a lot of user-space assuming the world is an RPi.
Is bluez still maintained? No user-space releases for 15 months! Upstream not convinced this is a problem, but distributions now out of synch as they have to choose between last release and arbitrary git snapshot.
Wi-fi and Bluetooth firmware fixes (including security fixes) missing
linux-firmware.git. RPi Foundation has improved Bluetooth
firmware for the chip they use but no-one else can redistribute it.
Lots of user-space uses
/sys/class/gpio, which is now deprecated and
can be disabled in kconfig. libgpiod would abstract this, but has poor
documentation. Most other GPIO libraries don't work with new GPIO
Similar issues with IIO - a lot of user-space doesn't use it but uses user-space drivers banging GPIOs etc. libiio exists but again has poor documentation.
For some drivers, even newly added drivers, the firmware has not
been added to
linux-firmware.git. Isn't there a policy that it
should be? It seems to be an unwritten rule at present.
Speaker: Kees Cook of Google
LWN article: https://lwn.net/Articles/798913/
Speaker: Dodji Seketeli of Red Hat
Speakers: Maciej Rozycki of WDC
LWN article: https://lwn.net/Articles/799331/