Man Alpha Technology - Capturing System Core Dumps

We recently published an article on process core dumping inside Docker containers. This article is a follow up on the related topic of capturing system core dumps (aka vmcores or crashdumps).

System core dumping

The goal of capturing the system core dumps is the same as for process core dumps - they allow us to debug issues with our systems, but in contrast to process core files, system core files are generated by the Linux kernel in response to the kernel itself crashing. They are invaluable in debugging kernel crashes, as they let you investigate what the system looked like at the time of crashing.

Traditionally, system core dumps are written to local disk in a special partition, or to a remote NFS share. The potential maximum size of a system core is the the amount of RAM and swap space in a server, so these files can get pretty big. As with process core dumps, we don’t want to maintain a large local partition for system core dumps, and we also don’t have NFS in our production environment, so what can we do to enable us to capture system cores?

Like for the process core files, we use FTP as a transport mechanism. FTP is very fast and it is easy to script, however the native RedHat/Centos crash utilities do not support FTP, so we’ve engineered a solution that does.

The use of FTP as a transport might sound somewhat controversial, but it has several good qualities:

It is a very fast protocol, approaching theoretical speed limits (as opposed to e.g SSH, which is significantly slower (important when system core files get large))
It does not need any special keys / secrets to work for us. We have an anonymous write only FTP server that we use to receive core dumps (inside our own network, of course).
Unlike NFS, we do not need a lot of ports opened up in firewalls, and there is no dependency on RPC.

You can find all files related to this article at our GitHub repo.

What happens when we crash?

To understand our approach to capturing system core dumps, it is first necessary to understand how system core generation works in RedHat/Centos.

When Linux first boots, the boot loader will usually instruct the Linux kernel to reserve 128Mb of RAM for crashkernel space. This is an area of memory that the kernel will not touch during its normal course of operation. E.g on RedHat/Centos 7.x, grub usually passes in the crashkernel=auto parameter to the kernel (see /etc/boot/grub2.cfg).

Systemd starts the kdump.service on boot. The kdump.service calls /usr/bin/kdumpctl, which loads a ‘dump-capture’ kernel into the reserved memory space. It does this by calling the /usr/sbin/kexec command line tool (see the kexec(8) man page).

kexec in turn calls the kexec_load() system call with the KEXEC_ON_CRASH flag. In addition to loading the dump-capture kernel and dump-capture initrd into the reserved memory area, it also causes the the kernel to automatically start the dump-capture kernel if the system crashes (see the kexec_load(2) man page).

When the kernel panics (i.e. something calls the panic() function) it will print a stack trace and then call the crash_kexec() function, which boots the dump-capture kernel we loaded previously.

The dump-capture kernel will then start the /usr/sbin/kdump process, which actually creates the system core.

The dump-capture disk image contents

Like the real kernel, the dump-capture kernel has an associated image file with its initial file system. We can add files to this image, and they will then be available to us during dump capture.

The initrd image is built when the /usr/bin/kdumpctl script is called from kdump.service. It reads /etc/kdump.conf and creates an image file with the appropriate content (using mkdumprd and dracut), and stores it in the /boot file system:

Our dump-capture image is a vanilla image, which we extend. We do this by instructing kdump to add extra files into the image (e.g. ncftpput, our pre-crash script etc). We use Ansible to create a custom “pre-hook” script per server using a template, which allows us to propagate IP (and other) information to the crashkernel.

Propagating state to the crashkernel

It is important to realise that the dump-capture kernel does not “know” anything about the state of the previously running system. However, as it has access to the whole memory of the server, it can run a program to dump this memory to the crashdump file. In fact, the crashdump is a memory image with some bits stripped out.

As a consequence of not having state from the previously running OS, e.g. the network setup is lost, and even local filesystems have to be remounted if we want to write a crashdump locally.

In order to send files over FTP, we obviously have to have a working network configuration. If we only had one subnet, we could imagine configuring the crashkernel to run with a specific, well defined, IP address. Kernel crashes are quite rare, and we’d be unlucky to have two kernels crash at the same time, so the fact that both crashkernels used the same IP would be unlikely to cause problems in the real world.

Unfortunately, we have many different subnets across our estate, and we don’t want to dedicate a specific IP per subnet for ‘crashing’, and also it would create extra tasks every time we changed network configurations.

Luckily the kdump process inside the crashkernel has a set of hooks it runs at various stages during its operation. We start our own script using the first available hook, and our script then takes over from kdump (and we never return control back to it).

As we have control of the contents of the image, we can pre-populate the script with the IP address of the individual server, so that it has the requisite information to set up its IP configuration.

The pre-crash script

When the crash kernel starts, the kdump process inside it executes our pre-crash.sh script, which performs the following steps:

Attempt to configure IP with DHCP.
If the previous step fails, use a statically set IP / network route / hostname, as determined and configured when kdump built the unique disk image.
Dump the stripped memory contents to our FTP server using anonymous FTP. We strip the image to make it smaller, in order to produce a “fast” crash dump. At this point, the server can be rebooted as we can get most of the required information out of the stripped crash dump.
After the initial dump has completed, we produce a full dump. This is usually not required for kernel debugging, but it can be useful in some instances.
Reboot the system.

How we roll this out system wide

We use Ansible extensively, and we have created a role which:

Configures kdump.
Drops the required scripts + binaries on the box.
Rebuilds the crashkernel image.

The Ansible role is part of our base playbook, so it gets rolled out automatically to any new systems or images that we build.

Sounds great, but does it work in practice?

Yes, it works very well! Kernel crashes are thankfully rare, but we have successfully captured crashdumps from every kernel crash we have experienced since rolling this out. Previously, crashdump generation was hit-and-miss due to the local disk requirement, but now there is no such problem. The server where the crashdump files end up has a lot of disk space as well, so we can keep crashdumps around for a long time for analysis of common issues and trends.

If your current setup relies on local disk, and/or NFS or SSH based captures, we think that the FTP based solution we have developed is well worth a look.

A fun fact about reboot()

It is also possible to start the dump-capture kernel by calling the reboot() system call with the LINUX_REBOOT_CMD_KEXEC flag (reboot.c), which in turn calls kernel_kexec() to start the dump-capture kernel that was previously loaded.

The reboot() system call has to be called with two magic parameters set to the correct values before it will actually cause a reboot.

We see this in the function signature:

SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, void __user *, arg)

The first parameter must be set to the value 4276215469, and the second parameter can be set to either 672274793, 85072278, 369367448 or 537993216. These are all defined in reboot.h

You may wonder what the significance of these numbers are, and if we convert them to hexadecimal notation using some Python, we see:

>>> map(lambda x: hex(x), [4276215469, 672274793, 85072278, 369367448,
>>> 537993216])
['0xfee1dead', '0x28121969', '0x5121996', '0x16041998', '0x20112000']
>>>

The first magic number is now obvious. The remaining numbers represent the birthday of Linus Torvalds and those of his three children. Whoever said us nerds can’t have fun? :-)

Capturing System Core Dumps

Henrik Bilar

AHL Tech Article 'Capturing System Core Dumps'.