What's new in the Linux kernel
and what's missing in Debian
Ben Hutchings
Ben Hutchings
-
Professional software engineer by day, Debian developer by night
(or sometimes the other way round)
-
Regular Linux contributor in both roles since 2008
-
Working on various drivers and kernel code in my day job
-
Debian kernel team member, now doing most of the unstable
maintenance aside from ports
-
Maintaining Linux 3.2.y stable update series on
kernel.org
Linux releases early and often
-
Linux is released about 5 times a year (plus stable updates
every week or two)
-
...though some features aren't ready to use when they first
appear in a release
-
Since my talk last year, Linus has made 6 releases (3.11-3.16)
-
Good news: we have lots of new kernel features in testing/unstable
-
Bad news: some of them won't really work without new userland
Recap of last year's features (1)
-
Team device driver: userland package (libteam) was uploaded in
October
-
Transcendent memory: frontswap, zswap and Xen tmem will be
enabled in next kernel upload
-
New KMS drivers: should all work with current Xorg drivers
-
Module signing: still not enabled, but probably will be if we
do Secure Boot
Recap of last year's features (2)
-
More support for discard: still not enabled at install time
(#690977)
-
More support for containers: XFS was fixed, and user namespaces
have been enabled
-
bcache: userland package (bcache-tools) still not quite ready
(#708132)
-
ARMv7 multiplatform: d-i works on some platforms but
I'm still not sure which. Some progress on GPU drivers, but not
in Debian yet.
Unnamed temporary files [3.11]
-
Open directory with option O_TMPFILE to create an
unnamed temporary file on that filesystem
-
As with tmpfile(), the file disappears on
last close()
-
File can be linked into the filesystem using
linkat(..., AT_EMPTY_PATH), allowing for 'atomic'
creation of file with complete contents and metadata
-
Not supported on all filesystem types, so you will usually need
a fallback
Network busy-polling [3.11] (1)
A conventional network request/response process looks like:
-
Task calls send(); network stack constructs a
packet; driver adds it to hardware Tx queue
-
Task calls poll() or recv(), which blocks;
kernel puts it to sleep and possibly idles the CPU
-
Network adapter receives response and generates IRQ, waking
up CPU
-
Driver's IRQ handler schedules polling of the hardware Rx
queue (NAPI)
-
Kernel runs the driver's NAPI poll function, which passes
the response packet into the network stack
-
Network stack decodes packet headers and adds packet to
the task's socket
-
Network stack wakes up sleeping task; scheduler switches
to it and the socket call returns
Network busy-polling [3.11] (2)
-
If driver supports busy-polling, it tags each packet with
the receiving NAPI context, and kernel tags sockets
-
When busy-polling is enabled, poll()
and recv() call the driver's busy poll function to
check for packets synchronously (up to some time limit)
-
If the response usually arrives quickly, this reduces overall
request/response latency as there are no context switches and
power transitions
-
Time limit set by sysctl (net.busy_poll,
net.busy_read) or socket option (SOL_SOCKET,
SO_BUSY_POLL); requires tuning
Lustre filesystem [3.12]
-
A distributed filesystem, popular for cluster computing
applications
-
Developed out-of-tree since 1999, but now added to Linux staging
directory
-
Was included in squeeze but dropped from wheezy as it didn't
support Linux 3.2
-
Userland is now missing from Debian
Btrfs offline dedupe [3.12]
-
Btrfs generally copies and frees blocks, rather than updating
in-place
-
This allows snapshots and file copies to copy-by-reference,
deferring the real copying until changes are made
-
Filesystems may still end up with multiple copies of the same
file content
-
Btrfs doesn't actively merge these duplicates, but userland can
tell it to do so
-
Many file dedupe tools are packaged for Debian, but not one that
works with this Btrfs feature, e.g. bedup
nftables [3.13]
-
Linux has several firewall APIs - iptables, ip6tables, arptables
and ebtables
-
All limited to single protocol, and need a kernel module for
each match type and each action
-
Kernel's internal netfilter API is more flexible
-
nftables exposes more of this flexibility, allowing userland
to provide firewall code for a specialised VM (similar to BPF)
-
nftables userland tool uses this API and is already packaged
-
Eventually, old APIs will be removed and old userland
tools must be ported to use nftables
User-space lockdep [3.14]
-
Kernel threads and interrupts all run in same address space,
using several different synchronisation mechanisms
-
Easy to introduce bugs that can result in deadlock, but hard to
reproduce them
-
Kernel's 'lockdep' system dynamically tracks locking operations
and detects potential deadlocks
-
Now available as a userland library! Except we need to package
it (build from linux-tools source package)
arm64 and ppc64el ports
-
'arm64' architecture was added in Linux 3.7, but was not yet
usable, and no real hardware was available at the time
-
Upstream Linux arm64 kernel, and Debian packages, should now run
on emulators and real hardware
-
'powerpc' architecture has been available for many years,
but didn't support kernel running little-endian
-
Linux 3.13 added little-endian kernel support, along with new
userland ELF ABI variant - we call it ppc64el
-
Both ports now being bootstrapped in unstable and are candidates
for jessie release
File-private locking [3.15]
-
POSIX says that closing a file descriptor removes
the process's locks on that file
-
What if process has multiple file descriptors for the same
file? It loses all locks obtained through any descriptor!
-
Multithreaded processes may require serialisation around
file open/close to ensure they open each file exactly once
-
Hard and symbolic links can hide that two files are really the
same
-
Linux now provides file-private locks, associated with a
specific open file and removed when last descriptor for the
open file is closed
Multiqueue block devices [3.16]
-
Each block device has a command queue (possibly shared with
other devices)
-
Queue may be partly implemented by hardware (NCQ) or only
in software
-
A single queue means initiation is serialised and completion
involves IPI - can be bottleneck for fast devices
-
High-end SSDs support multiple queues, but kernel needed changes
to use them
-
mtip32xx driver now supports multiqueue, but SCSI
drivers don't yet - may be backport-able?
Questions?
Credits
-
Linux 'Tux' logo © Larry Ewing, Simon Budig.
- Modified by Ben to add Debian open-ND logo
-
Debian open-ND logo © Software in the Public Interest, Inc.