AHL Tech Article 'Capturing System Core Dumps'.
AHL Tech Article 'Capturing System Core Dumps'.
November 2018
We recently published an article on process core dumping inside Docker containers. This article is a follow up on the related topic of capturing system core dumps (aka vmcores or crashdumps).
System core dumping
The goal of capturing the system core dumps is the same as for process core dumps - they allow us to debug issues with our systems, but in contrast to process core files, system core files are generated by the Linux kernel in response to the kernel itself crashing. They are invaluable in debugging kernel crashes, as they let you investigate what the system looked like at the time of crashing.
Traditionally, system core dumps are written to local disk in a special partition, or to a remote NFS share. The potential maximum size of a system core is the the amount of RAM and swap space in a server, so these files can get pretty big. As with process core dumps, we don’t want to maintain a large local partition for system core dumps, and we also don’t have NFS in our production environment, so what can we do to enable us to capture system cores?
Like for the process core files, we use FTP as a transport mechanism. FTP is very fast and it is easy to script, however the native RedHat/Centos crash utilities do not support FTP, so we’ve engineered a solution that does.
The use of FTP as a transport might sound somewhat controversial, but it has several good qualities:
- It is a very fast protocol, approaching theoretical speed limits (as opposed to e.g SSH, which is significantly slower (important when system core files get large))
- It does not need any special keys / secrets to work for us. We have an anonymous write only FTP server that we use to receive core dumps (inside our own network, of course).
- Unlike NFS, we do not need a lot of ports opened up in firewalls, and there is no dependency on RPC.
You can find all files related to this article at our GitHub repo.
What happens when we crash?
To understand our approach to capturing system core dumps, it is first necessary to understand how system core generation works in RedHat/Centos.
When Linux first boots, the boot loader will usually instruct the Linux kernel to reserve 128Mb of RAM for crashkernel
space. This is an area of memory that the kernel will not touch during its normal course of operation. E.g on RedHat/Centos 7.x, grub
usually passes in the crashkernel=auto
parameter to the kernel (see /etc/boot/grub2.cfg
).
Systemd starts the kdump.service
on boot. The kdump.service
calls /usr/bin/kdumpctl
, which loads a ‘dump-capture’ kernel into the reserved memory space. It does this by calling the /usr/sbin/kexec
command line tool (see the kexec(8) man page).
kexec
in turn calls the kexec_load()
system call with the KEXEC_ON_CRASH
flag. In addition to loading the dump-capture kernel and dump-capture initrd into the reserved memory area, it also causes the the kernel to automatically start the dump-capture kernel if the system crashes (see the kexec_load(2) man page).
When the kernel panics (i.e. something calls the panic()
function) it will print a stack trace and then call the crash_kexec()
function, which boots the dump-capture kernel we loaded previously.
The dump-capture kernel will then start the /usr/sbin/kdump
process, which actually creates the system core.
The dump-capture disk image contents
Like the real kernel, the dump-capture kernel has an associated image file with its initial file system. We can add files to this image, and they will then be available to us during dump capture.
The initrd image is built when the /usr/bin/kdumpctl
script is called from kdump.service
. It reads /etc/kdump.conf
and creates an image file with the appropriate content (using mkdumprd
and dracut
), and stores it in the /boot
file system:
Our dump-capture image is a vanilla image, which we extend. We do this by instructing kdump to add extra files into the image (e.g. ncftpput, our pre-crash script etc). We use Ansible to create a custom “pre-hook” script per server using a template, which allows us to propagate IP (and other) information to the crashkernel.
Propagating state to the crashkernel
It is important to realise that the dump-capture kernel does not “know” anything about the state of the previously running system. However, as it has access to the whole memory of the server, it can run a program to dump this memory to the crashdump file. In fact, the crashdump is a memory image with some bits stripped out.
As a consequence of not having state from the previously running OS, e.g. the network setup is lost, and even local filesystems have to be remounted if we want to write a crashdump locally.
In order to send files over FTP, we obviously have to have a working network configuration. If we only had one subnet, we could imagine configuring the crashkernel to run with a specific, well defined, IP address. Kernel crashes are quite rare, and we’d be unlucky to have two kernels crash at the same time, so the fact that both crashkernels used the same IP would be unlikely to cause problems in the real world.
Unfortunately, we have many different subnets across our estate, and we don’t want to dedicate a specific IP per subnet for ‘crashing’, and also it would create extra tasks every time we changed network configurations.
Luckily the kdump
process inside the crashkernel has a set of hooks it runs at various stages during its operation. We start our own script using the first available hook, and our script then takes over from kdump
(and we never return control back to it).
As we have control of the contents of the image, we can pre-populate the script with the IP address of the individual server, so that it has the requisite information to set up its IP configuration.
The pre-crash script
When the crash kernel starts, the kdump
process inside it executes our pre-crash.sh
script, which performs the following steps:
- Attempt to configure IP with DHCP.
- If the previous step fails, use a statically set IP / network route / hostname, as determined and configured when kdump built the unique disk image.
- Dump the stripped memory contents to our FTP server using anonymous FTP. We strip the image to make it smaller, in order to produce a “fast” crash dump. At this point, the server can be rebooted as we can get most of the required information out of the stripped crash dump.
- After the initial dump has completed, we produce a full dump. This is usually not required for kernel debugging, but it can be useful in some instances.
- Reboot the system.
How we roll this out system wide
We use Ansible extensively, and we have created a role which:
- Configures
kdump
. - Drops the required scripts + binaries on the box.
- Rebuilds the crashkernel image.
The Ansible role is part of our base playbook, so it gets rolled out automatically to any new systems or images that we build.
Sounds great, but does it work in practice?
Yes, it works very well! Kernel crashes are thankfully rare, but we have successfully captured crashdumps from every kernel crash we have experienced since rolling this out. Previously, crashdump generation was hit-and-miss due to the local disk requirement, but now there is no such problem. The server where the crashdump files end up has a lot of disk space as well, so we can keep crashdumps around for a long time for analysis of common issues and trends.
If your current setup relies on local disk, and/or NFS or SSH based captures, we think that the FTP based solution we have developed is well worth a look.
A fun fact about reboot()
It is also possible to start the dump-capture kernel by calling the reboot()
system call with the LINUX_REBOOT_CMD_KEXEC
flag (reboot.c), which in turn calls kernel_kexec()
to start the dump-capture kernel that was previously loaded.
The reboot()
system call has to be called with two magic parameters set to the correct values before it will actually cause a reboot.
We see this in the function signature:
SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, void __user *, arg)
The first parameter must be set to the value 4276215469, and the second parameter can be set to either 672274793, 85072278, 369367448 or 537993216. These are all defined in reboot.h
You may wonder what the significance of these numbers are, and if we convert them to hexadecimal notation using some Python, we see:
>>> map(lambda x: hex(x), [4276215469, 672274793, 85072278, 369367448,
>>> 537993216])
['0xfee1dead', '0x28121969', '0x5121996', '0x16041998', '0x20112000']
>>>
The first magic number is now obvious. The remaining numbers represent the birthday of Linus Torvalds and those of his three children. Whoever said us nerds can’t have fun? :-)
You are now exiting our website
Please be aware that you are now exiting the Man Group website. Links to our social media pages are provided only as a reference and courtesy to our users. Man Group has no control over such pages, does not recommend or endorse any opinions or non-Man Group related information or content of such sites and makes no warranties as to their content. Man Group assumes no liability for non Man Group related information contained in social media pages. Please note that the social media sites may have different terms of use, privacy and/or security policy from Man Group.