Kernel panic on a Debian NAS: recovering a truncated libc without rebooting into the void

Kernel panic on a Debian NAS: recovering a truncated libc without rebooting into the void

·15 min read·Updated on May 24, 2026
Who this is for

You administer a Debian, Ubuntu, Proxmox, or any other APT-based system in personal production (NAS, homelab, VPS). You want to understand why a simple unattended-upgrade can brick a machine

  • and how to fix it without reinstalling. Required level: comfortable with the shell, chroot, dpkg, debugfs.

One morning, the NAS is dead

Morning routine. I check my monitoring dashboard. A red widget. My NAS isn't reachable. OK. I try ping, nothing. SSH, nothing. I walk to the room where it lives, plug in a keyboard and a monitor. And then, GRUB throws me a line I didn't want to see:

Boot Option Restored

Followed by a lonely grub>. Charming. I type exit. The normal GRUB menu comes back. But with a single kernel entry. No old kernel. No recovery mode. I select. And then:

Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00

Followed by a beautiful stack trace:

panic → do_exit.cold → do_group_exit → __x64_sys_exit_group
     → do_syscall_64 → handle_mm_fault → do_user_addr_fault

Segfault in PID 1. Systemd dying on the very first userspace exec. The kernel has no choice: it panics. Same in single mode (which I don't even have, I'll come back to that). Same after manually stripping the hardening options via grub edit. The problem isn't in the kernel line. It's right after.

OK. Coffee. We're going to need it.

The context (because I went looking for it)

My NAS is a Terramaster F6-424 on which I replaced the stock TOS with a Debian 13 (Trixie) installed on the 4 GB internal USB key. The 6 bays host (among other things) old Synology disks that still contain RAID5 + LVM + Btrfs structures (vg1-volume_1) mounted on /mnt/data.

The system is hardened along a CIS / ANSSI baseline:

  • AppArmor in enforce mode
  • auditd active, sudo-io logs, PCP
  • Kernel options: oops=panic slab_nomerge init_on_alloc=1 init_on_free=1 page_alloc.shuffle=1 randomize_kstack_offset=on pti=on vsyscall=none debugfs=off
  • GRUB_DISABLE_RECOVERY="true" - no recovery mode in the menu

I'm writing this now as if it were neutral. Spoiler: almost every one of those choices is going to backfire on me in the hours that follow.

Diagnosis: live USB boot and debugfs

I burn a Debian 13.5 XFCE on another stick. Boot into it.

The disk inventory (lsblk -f) confirms the system is on /dev/sdf (USB Flash 3.75 GiB):

  • sdf1: 243 MB FAT32 → /boot/efi
  • sdf2: 3.5 GB ext4 → /

First reflex: fsck -n -f /dev/sdf2. Clean ext4 structure. No filesystem errors. So the corruption isn't in the structure. It's in the contents of files.

Out comes debugfs, which lets you read an ext4 FS without mounting it (so strictly read-only):

debugfs -R 'cat /etc/debian_version' /dev/sdf2
debugfs -R 'ls -l /boot' /dev/sdf2
debugfs -R 'stat /lib/systemd/systemd' /dev/sdf2
debugfs -R 'cat /var/log/apt/history.log' /dev/sdf2

Surprise in /var/log/apt/: history.log and term.log are 0 bytes. Truncated. The archived .gz contains the previous history. That's where we start seeing the timeline that smells bad.

I finally mount the partition to be able to chroot:

mount /dev/sdf2 /mnt/sysroot
mount /dev/sdf1 /mnt/sysroot/boot/efi
chroot /mnt/sysroot /lib/systemd/systemd

Reply:

error while loading shared libraries:
/lib/x86_64-linux-gnu/libc.so.6: invalid ELF header

And then, file:

$ file /mnt/sysroot/usr/lib/x86_64-linux-gnu/libc.so.6
ISO-8859 text, with very long lines

libc isn't an ELF anymore. It still weighs 1,999,312 bytes, which is roughly the right size. But its blocks are random text. The kernel loads fine, exec init (= systemd), the loader tries to map libc, segfault, init dies, panic.

At that moment, I just want to go on vacation.

The reconstructed timeline

history.log.1.gz tells the whole story:

  • February 21, 12:45: first saturation during an apt purge. Error: mandb: cannot write to /var/cache/man/...: No space left on device.
  • February 27, 06:45: unattended-upgrade upgrades libnss3, triggers the libc-bin triggers.
  • February 28, 23:45: history.log and term.log truncated to zero.

Conclusion: the second saturation happened during the libc-bin trigger. dpkg partially wrote libc.so.6. The file keeps its apparent size but some blocks are unwritten or overwritten by something else. Invalid ELF.

Why the crash was inevitable

I'd accumulated six reasonable decisions made in isolation. And catastrophic combined:

  1. 3.5 GB system disk. A 4 GB USB stick for an OS that downloads several hundred MB of updates a month.
  2. unattended-upgrade was running unattended, no mail alert, no pre-transaction disk check.
  3. GRUB_DISABLE_RECOVERY="true": no recovery mode in the menu. Can't boot single-user to fix things.
  4. Single kernel installed (apt autoremove reflex). No fallback.
  5. oops=panic: any kernel oops = immediate full panic.
  6. Aggressive hardening: reduced failure tolerance. Logical: you optimize for the attacker, not for the operator panicking at 7 AM.

None of these choices is bad individually. Combined, they create a system that doesn't know how to catch itself.

The resolution (the chicken-and-egg trap)

The whole difficulty fits in one sentence: apt and dpkg depend on libc. As long as libc is broken, I can't chroot and run apt --reinstall. Everything segfaults.

The solution is to extract the .deb by hand from the live USB environment (which has its own intact libc).

1. Prepare the bind mounts

R=/mnt/sysroot
mount --bind /dev      "$R/dev"
mount --bind /dev/pts  "$R/dev/pts"
mount -t proc proc     "$R/proc"
mount -t sysfs sys     "$R/sys"
mount --bind /run      "$R/run"
cp -L /etc/resolv.conf "$R/etc/resolv.conf"

2. Make some room

On a hardened system, many large logs can go without functional risk:

R=/mnt/sysroot
find "$R/var/cache/apt/archives" -name '*.deb' -delete
rm -rf "$R/var/cache/swcatalog/"*
rm -rf "$R/var/log/installer"
rm -f  "$R/var/log/"*.gz  "$R/var/log/"*.old
rm -rf "$R/var/log/journal/"*
find "$R/var/log/sudo-io" -mindepth 1 -delete
find "$R/var/log/pcp" -type f -delete
rm -f  "$R/var/log/audit/audit.log."*
truncate -s 0 "$R/var/log/audit/audit.log"

Goal: ≥ 500 MB free so the reinstalls can write without re-saturating.

3. Download the right .deb in the live environment, not in the chroot

cd /tmp
apt-get update
apt-get download libc6 libc-bin libc-l10n

4. Backup the corrupted libc (useful for forensic analysis)

cp /mnt/sysroot/usr/lib/x86_64-linux-gnu/libc.so.6 \
   /tmp/libc.so.6.corrupted.bak

5. Extract the .deb directly onto the target

This is the key step. dpkg-deb -x just overwrites the files, without touching the dpkg database:

dpkg-deb -x /tmp/libc6_*.deb     /mnt/sysroot
dpkg-deb -x /tmp/libc-bin_*.deb  /mnt/sysroot
dpkg-deb -x /tmp/libc-l10n_*.deb /mnt/sysroot

Check that libc.so.6 is executable again:

$ /mnt/sysroot/usr/lib/x86_64-linux-gnu/libc.so.6
GNU C Library (Debian GLIBC 2.41-12+deb13u3) stable release version 2.41.

Boom. It's alive. First real smile of the morning.

6. Chroot, and reinstall cleanly via apt

chroot "$R" /bin/bash
/bin/true && echo "userspace OK"

apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install --reinstall -y \
    libc6 libc-bin libc-l10n

apt sees the files are already there (step 5 extraction), but this time it records the version in the dpkg database and runs the postinst scripts. Consistent state recovered.

7. Full audit via debsums

debsums verifies the MD5 checksums provided by each .deb for all installed files. Far more thorough than dpkg --verify which only covers conffiles:

apt-get install -y debsums
debsums -ac 2>&1 | tee /tmp/debsums.out

Interpretation:

  • Files in /etc/* modified → normal, these are conffiles modified by the admin (hardening, custom configs). Don't restore.
  • Files in /usr/lib/modules/* failing or missing → critical, reinstall the linux-image-X package.
  • Files in /usr/bin, /usr/lib, /usr/sbin failing → critical, reinstall the owning package.

In my case: 7 corrupted kernel modules + btrfs.ko.xz MISSING. The saturation had also hit the kernel modules. If I'd rebooted thinking I was done with libc, the system would have panicked again on the Btrfs mount. Fix:

apt-get install --reinstall -y linux-image-6.12.73+deb13-amd64

This also regenerates the initramfs automatically.

8. Regenerate GRUB and initramfs (safety net)

update-initramfs -u -k all
update-grub

No need for grub-install if the EFI is intact: the /EFI/BOOT/BOOTX64.EFI fallback + the fbx64.efi shim recreate the NVRAM entry on next boot.

9. Clean unmount and reboot

exit
sync
umount -R /mnt/sysroot/sys /mnt/sysroot/proc /mnt/sysroot/dev
umount /mnt/sysroot/run /mnt/sysroot/boot/efi /mnt/sysroot
fsck.ext4 -f /dev/sdf2
fsck.vfat -a /dev/sdf1
reboot

The NAS reboots. Login. uname -r. Everything is there. I've won. Or rather, I've survived.

The plot twist (May 2026)

Three months later. End of day. I check my dashboard for a NAS health follow-up. First SSH check:

/dev/sde2  3.4G  3.1G  107M  97% /    ← root at 97 %
load average: 8.54

Disk at 97%, load at 8. And a quick ps shows what's running:

PID 4595   root  /usr/bin/python3 /usr/bin/unattended-upgrade
PID 17266  root  /usr/bin/dpkg --unpack ... /tmp/apt-dpkg-install-dv8QKx
PID 343    md2_raid5     38.9% CPU
PID 820    md2_resync    16.7% CPU

unattended-upgrade is in progress. dpkg is unpacking. The RAID resync is hammering I/O. Many processes in D state (uninterruptible). This is exactly the February scenario replaying.

Thirty seconds later, SSH refuses new connections. A minute after, no TCP port is reachable. ping still works, so the kernel is alive. But userspace is frozen. I see on the console (photo taken by my colleague next to the NAS):

EXT4-fs (sde2): failed to convert unwritten extents to written extents
                -- potential data loss! (inode XXXX, error -5)
EXT4-fs error (device sde2) in ext4_reserve_inode_write:5980: IO failure
[FAILED] Failed unmounting boot-efi.mount
[FAILED] Failed to start docker.service (×6)

error -5 = EIO. Hardware I/O error. It wasn't just the disk filling up. It was the USB key starting to rot. The February incident wasn't caused by the full disk. It was caused by flash fatigue. The full disk had just been the last straw.

I'd spent three months blaming unattended-upgrade. The real culprit was hiding behind.

No choice: hard power-off at the button. Massive risk of reproducing the libc corruption. Reboot. And then, miracle:

$ file /usr/lib/x86_64-linux-gnu/libc.so.6
ELF 64-bit LSB shared object [...]
$ /usr/lib/x86_64-linux-gnu/libc.so.6
GNU C Library (Debian GLIBC 2.41-12+deb13u3) stable release version 2.41.

libc OK. FS clean. The dpkg transaction running at the moment of the crash: gpg / gpg-agent / dirmngr / gnupg-utils. Not a critical package. Massive luck. If it had been libc or systemd, we were back to the live USB.

2026 best practices I applied right after

Faced with the evidence that I'd dodged the bullet twice in three months, I applied a checklist immediately, without waiting for the hardware migration. All these measures are free, take ten minutes, and reproduce what every serious sysadmin used to do before unattended-upgrades got so transparent that we forgot about it.

1. Disable unattended-upgrades (the trigger)

sudo systemctl stop unattended-upgrades.service
sudo systemctl disable unattended-upgrades.service
sudo systemctl mask apt-daily.timer apt-daily-upgrade.timer

On a fragile system (small disk, RAID resyncing, aggressive hardening), unattended-upgrades is a risk, not a protection. I'd rather do updates by hand, on weekends, while watching. It's my machine, I know when it has headroom.

2. apt-mark hold on critical packages

sudo apt-mark hold linux-image-6.12.73+deb13-amd64 linux-image-amd64 \
    libc6 libc-bin libc-l10n systemd systemd-sysv grub-efi-amd64

Even if I run apt upgrade one day, these packages won't move until I explicitly do apt-mark unhold. For a kernel, that means I test the new version on another machine first, keep the old one as fallback, then migrate.

3. APT pre-invoke disk space check

The root trigger was saturation. Here's a hook that refuses any transaction if the headroom isn't there:

# /usr/local/bin/apt-disk-check.sh
#!/bin/bash
set -e
MIN_ROOT_MB=300
MIN_VAR_MB=300
MIN_CACHE_MB=500
PHASE="${1:-dpkg}"

free_mb() { df -BM --output=avail "$1" 2>/dev/null | tail -1 | tr -d ' M'; }

if [ "$PHASE" = "update" ]; then
  c=$(free_mb /var/cache/apt)
  [ -n "$c" ] && [ "$c" -lt "$MIN_CACHE_MB" ] && {
    echo "REFUSE APT update: ${c}M free on /var/cache/apt" >&2; exit 1; }
else
  r=$(free_mb /); v=$(free_mb /var)
  [ -n "$r" ] && [ "$r" -lt "$MIN_ROOT_MB" ] && {
    echo "REFUSE dpkg: ${r}M free on /" >&2; exit 1; }
  [ -n "$v" ] && [ "$v" -lt "$MIN_VAR_MB" ] && {
    echo "REFUSE dpkg: ${v}M free on /var" >&2; exit 1; }
fi

And the matching APT config:

// /etc/apt/apt.conf.d/99-disk-space-check
APT::Update::Pre-Invoke   { "/usr/local/bin/apt-disk-check.sh update"; };
DPkg::Pre-Invoke          { "/usr/local/bin/apt-disk-check.sh dpkg"; };

Side note: I first tried inlining the awk check directly into the .conf file with nested \". The APT parser doesn't like that. Better an external script: testable, debuggable, and you don't embed exotic syntax in a config file.

4. Disk space alert < 25% free

A cron every 15 minutes:

# /usr/local/bin/disk-alert.sh
#!/bin/bash
THRESHOLD="${THRESHOLD:-75}"
MAIL_TO="${MAIL_TO:-root}"
HOST=$(hostname)
df --output=source,pcent,target -x tmpfs -x devtmpfs -x squashfs -x overlay -x ecryptfs \
  | tail -n +2 \
  | while read fs use mnt; do
      pct="${use%\%}"
      [ -z "$pct" ] && continue
      case "$pct" in (*[!0-9]*) continue;; esac
      if [ "$pct" -ge "$THRESHOLD" ]; then
        msg="[disk-alert] ${HOST}: ${fs} at ${pct}% (mount ${mnt})"
        logger -t disk-alert -p user.warning "$msg"
        command -v mail >/dev/null 2>&1 && echo "$msg" | mail -s "$msg" "$MAIL_TO" || true
      fi
    done
# /etc/cron.d/disk-alert
*/15 * * * * root THRESHOLD=75 MAIL_TO=admin@example.com /usr/local/bin/disk-alert.sh

Always push at least locally via logger. That way, even without an MTA configured, the info ends up in journald and can be checked later. Mail is a luxury.

5. Re-enable recovery mode in GRUB

sudo sed -i 's/^GRUB_DISABLE_RECOVERY=.*/GRUB_DISABLE_RECOVERY="false"/' /etc/default/grub

Recovery mode gives you root single-user shell access without starting any service. When the system won't boot normally but can chroot, that's the difference between 5 minutes of repair and an evening with a live USB.

6. oops=panicpanic=30

On a critical secured server (datacenter, banking, kernel exploitation context), oops=panic forces an immediate reboot at the slightest kernel warning to limit a potential compromise. On a personal NAS, that's counterproductive paranoia: every benign oops (capricious USB driver, harmless warning) becomes a full panic.

# Edit /etc/default/grub:
# BEFORE: ...debugfs=off oops=panic audit_backlog_limit=8192 panic=10
# AFTER : ...debugfs=off audit_backlog_limit=8192 panic=30
sudo update-grub

panic=30 keeps the essential benefit: auto-reboot after a real panic, leaving 30 seconds to see context on the console.

7. Protect kernels against apt autoremove

// /etc/apt/apt.conf.d/01autoremove-kernels
APT::NeverAutoRemove {
  "^linux-image-.*";
  "^linux-headers-.*";
  "^linux-modules-.*";
};

apt autoremove is still useful for cleaning up orphan packages, but it'll never be able to remove a kernel. Even the one you don't use anymore. You remove those by hand when you know what you're doing.

What's left to do (and was obvious from the start)

All the measures above are band-aids. The real fix is to ditch the USB key. A 4 GB USB key as system disk is playing with fire: wearing flash, no usable SMART, cheap controller, zero margin for updates.

  • Migrate to an M.2 or SATA SSD ≥ 64 GB (Samsung 870 EVO 250 GB at €30 does the job, endurance ×100). The F6-424 has a dedicated M.2 slot, it'd be a crime not to use it.
  • Before migration: full dd of the current USB key (dd if=/dev/sdX of=/mnt/data/backup/nas-usb-$(date +%F).img bs=64M conv=fsync + gzip --best). Guaranteed restore point.
  • Btrfs Snapper snapshots on /mnt/data: free in space thanks to CoW, and btrfs send --proto 3 (new in Debian 13) speeds up incremental send ×3.
  • Real 3-2-1 backup: /mnt/data is a RAID, not a backup. Borgmatic + offsite (rsync.net, Hetzner Storage Box, Backblaze B2). Monthly verification borg check --verify-data.
  • SMART monitoring on the HDDs (smartd + mails). A RAID5 with a disk dying during resync = total loss. That's why RAID6 / mirror+stripe are preferable on disks over 5 years old.

What I'm taking away

Corrupted libc.so.6 = immediate kernel panic with no clear signature in the panic message. The stack trace shows do_user_addr_fault → segfault in PID 1, but doesn't say WHICH library is at fault. You have to inspect by hand.

CIS hardening protects against attackers, but reduces failure tolerance. On a personal NAS, a softer compromise (recovery mode enabled, older kernels kept, oops=panic removed) is probably wiser than the strict Level 2 baseline.

APT hooks to pre-check disk space are underused even though they're trivial to set up. If you only remember one thing from this article, set one up tonight.

debsums is the reference tool for post-incident audit on Debian. Default install on every server, it's my new reflex.

A 4 GB USB key as system disk isn't a choice, it's a gamble. The cost/risk ratio of a real SSD is unbeatable.

And above all: what saved me in May was having a written memory of the February incident. When the scenario replayed (same late hour, same fatigue, same panic tunnel), I didn't have to reinvent the diagnosis. Document your incidents. Turn them into blog posts. Future-you will thank you when the NAS crashes again.

References

ShareLinkedInXBluesky

Related articles