Skip to content

Kernel Hardening

Capsem builds a custom Linux kernel from allnoconfig — starting with everything disabled and enabling only what the VM needs. The result is a ~5 MB kernel with no loadable modules, no debugfs, no IPv6, and full exploit mitigations.

An unhardened guest kernel gives a malicious agent multiple escalation paths:

VectorRisk without hardening
Loadable modulesAgent loads a .ko to hijack kernel functions
/dev/mem, /dev/portDirect physical memory read/write from userspace
debugfsKernel internals exposed to guest processes
BPF, io_uringHigh-CVE-count subsystems reachable via syscall
32-bit compat syscallsLegacy ABI with known exploitation primitives
/proc/kallsymsKernel symbol addresses defeat KASLR

Capsem eliminates all of these at compile time.

graph TB
    subgraph "Kernel Hardening Stack"
        A["Minimal config (allnoconfig base)"] --> B["Disabled subsystems"]
        B --> C["Memory mitigations"]
        C --> D["Architecture-specific hardening"]
        D --> E["Boot cmdline params"]
        E --> F["Runtime validation (capsem-doctor)"]
    end

Every disabled subsystem removes code from the kernel binary. No runtime flag can re-enable it.

SubsystemConfigWhy disabled
Loadable modulesMODULES=nPrevents loading .ko files; even root cannot extend the kernel
/dev/memDEVMEM=nBlocks direct physical memory access from userspace
/dev/portDEVPORT=nBlocks I/O port access
debugfsDEBUG_FS=nKernel debug info leak vector
Kernel symbolsKALLSYMS=nHides kernel addresses, preserves KASLR effectiveness
io_uringIO_URING=nHigh CVE count; unnecessary in a sandboxed VM
BPF syscallBPF_SYSCALL=nExploitation vector for privilege escalation
userfaultfdUSERFAULTFD=nUsed in race condition exploits
32-bit compatCOMPAT=n / IA32_EMULATION=nEliminates entire legacy syscall attack surface
kexecKEXEC=n, KEXEC_FILE=nNo kernel hot-swap
HibernationHIBERNATION=nNo suspend-to-disk (memory dump vector)
Magic SysRqMAGIC_SYSRQ=nNo emergency keyboard commands
IPv6IPV6=nUnnecessary in air-gapped VM; reduces IP stack surface
MulticastIP_MULTICAST=nNo multicast traffic
nftablesNF_TABLES=nUse iptables-legacy only (simpler, smaller)
USBUSB_SUPPORT=nNo USB devices in VM
SoundSOUND=nNo audio hardware
DRM/GPUDRM=nNo graphics hardware
WiFi/BluetoothWLAN=n, WIRELESS=n, BT=nNo wireless hardware
Keyboard/MouseINPUT_KEYBOARD=n, INPUT_MOUSE=nNo HID devices
NFSNFS_FS=n, NETWORK_FILESYSTEMS=nNo remote filesystems
SCSI/ATASCSI=n, ATA=nVirtIO only; no legacy block drivers
EthernetETHERNET=n, NET_VENDOR_VIRTIO=nAir-gapped; only dummy NIC
MitigationConfigEffect
Heap zeroingINIT_ON_ALLOC_DEFAULT_ON=yEvery kmalloc returns zeroed memory; prevents info leaks
Slab freelist randomizationSLAB_FREELIST_RANDOMIZE=yRandomizes freed slab object order; defeats heap spraying
Slab freelist hardeningSLAB_FREELIST_HARDENED=yValidates freelist metadata; detects heap corruption
Page allocator shuffleSHUFFLE_PAGE_ALLOCATOR=yRandomizes page allocation order
Hardened usercopyHARDENED_USERCOPY=yValidates copy_to_user/copy_from_user bounds
Strict kernel RWXSTRICT_KERNEL_RWX=yEnforces W^X on kernel memory pages
Virtual mapped stacksVMAP_STACK=yKernel stacks as virtual memory; detects overflow via guard pages
KASLRRANDOMIZE_BASE=yRandomizes kernel load address
Stack protectorSTACKPROTECTOR=y, STACKPROTECTOR_STRONG=yStack canaries on all functions with local variables
FORTIFY_SOURCEFORTIFY_SOURCE=yCompile-time buffer overflow detection
dmesg restrictionSECURITY_DMESG_RESTRICT=yOnly root can read kernel log
Heap ASLRCOMPAT_BRK=nEnables full heap randomization
SeccompSECCOMP=y, SECCOMP_FILTER=yUserspace syscall filtering (defense in depth)

The kernel includes different hardware mitigations depending on the target architecture.

Mitigationarm64x86_64Purpose
Branch Target IdentificationARM64_BTI=ySpectre-BHB mitigation; restricts indirect branch targets
Pointer AuthenticationARM64_PTR_AUTH=y, ARM64_PTR_AUTH_KERNEL=ySigns return addresses; defeats ROP chains
Kernel unmapping at EL0UNMAP_KERNEL_AT_EL0=yRemoves kernel pages from userspace page tables
Branch predictor hardeningHARDEN_BRANCH_PREDICTOR=yFlushes branch predictor on context switch
Page Table Isolation (KPTI)PAGE_TABLE_ISOLATION=yMeltdown mitigation; separate kernel/user page tables
RetpolineRETPOLINE=ySpectre v2 mitigation; replaces indirect branches

Runtime hardening parameters passed via kernel cmdline:

console={hvc0|ttyS0} root=/dev/vda ro init_on_alloc=1 slab_nomerge page_alloc.shuffle=1
ParameterRationale
roMount rootfs read-only; squashfs is structurally immutable
init_on_alloc=1Runtime enforcement of heap zeroing (belt-and-suspenders with INIT_ON_ALLOC_DEFAULT_ON)
slab_nomergePrevents kernel from merging slab caches; isolates allocations by type
page_alloc.shuffle=1Randomizes page allocator at boot (complements SHUFFLE_PAGE_ALLOCATOR)

Console device varies by architecture: hvc0 for ARM64 (Apple VZ), ttyS0 for x86_64 (KVM).

Every hardening property is verified at runtime by capsem-doctor tests. If any test fails, the VM is not considered healthy.

Propertycapsem-doctor testWhat it checks
No kernel modulestest_no_kernel_modulesmodprobe fails
No /dev/memtest_no_dev_memFile does not exist
No /dev/porttest_no_dev_portFile does not exist
No /proc/kcoretest_no_proc_kcoreFile absent or unreadable
No /proc/modulestest_proc_modules_emptyFile absent or empty
No debugfstest_no_debugfsNot mounted
No IPv6test_no_ipv6/proc/net/if_inet6 absent
No kernel symbolstest_no_kallsyms/proc/kallsyms absent or empty
Read-only rootfstest_kernel_cmdline_has_roro token in /proc/cmdline
Heap zeroingtest_init_on_allocinit_on_alloc=1 in /proc/cmdline
Slab isolationtest_slab_nomergeslab_nomerge in /proc/cmdline
Page shuffletest_page_alloc_shufflepage_alloc.shuffle=1 in /proc/cmdline
Seccomp availabletest_seccomp_availableSeccomp: line in /proc/self/status
Squashfs rootfstest_squashfs_is_immutable/dev/vda filesystem type is squashfs
Overlay configuredtest_overlay_configuredRoot mount is overlay with lowerdir and upperdir
No real NICstest_no_real_nicsOnly lo and dummy0 in /sys/class/net/
No setuid binariestest_no_setuid_binariesfind / -perm -4000 returns empty
No setgid binariestest_no_setgid_binariesfind / -perm -2000 returns empty
Guest binaries read-onlytest_guest_binary_not_writableAll capsem binaries are chmod 555
No sshdtest_no_sshdsshd process not running
No crontest_no_croncron process not running
No systemdtest_no_systemdsystemd process not running

The kernel config follows the principle of minimum viable surface: start from allnoconfig (everything off), then enable only what the VM requires. This is the opposite of a typical distro kernel, which starts from a broad default and disables selectively.

graph LR
    subgraph "Typical distro kernel"
        D1["~8000 options enabled"] --> D2["Selective disable"]
        D2 --> D3["Still large attack surface"]
    end
    subgraph "Capsem kernel"
        C1["allnoconfig (0 options)"] --> C2["Enable only needed"]
        C2 --> C3["~200 options, ~5 MB binary"]
    end

The two defconfig files (defconfig.arm64, defconfig.x86_64) are applied with make olddefconfig and produce identical security properties on both architectures.