Boosting Linux Per-Core I/O Performance: A Developer's Guide

Introduction

In the fast-evolving world of Linux storage, achieving maximum per-core I/O throughput is a constant challenge. Recent work by Jens Axboe, the lead developer of IO_uring and maintainer of the Linux block layer, has demonstrated a remarkable 60% increase in per-core I/O performance through targeted kernel patches. Inspired by a presentation at the Linux Storage, Filesystem, Memory Management and BPF Summit (LSFMM) in Croatia—where the I/O overhead of Linux was compared unfavorably to the Storage Performance Development Kit (SPDK)—Axboe set out to close the gap. This guide walks you through the methodology and steps he used to identify, analyze, and implement optimizations that dramatically improve per-core I/O. Whether you are a kernel hacker, a storage engineer, or a performance enthusiast, these steps will help you understand the process behind such impactful changes.

Boosting Linux Per-Core I/O Performance: A Developer's Guide

What You Need

Before diving in, ensure you have the following prerequisites:

A Linux development system (ideally with a recent kernel source tree, e.g., 6.x or later)
Familiarity with kernel compilation and patching
Understanding of I/O concepts (syscalls, block layer, IO_uring)
Access to performance benchmarking tools (e.g., fio, perf, blktrace)
A storage device capable of high I/O rates (e.g., NVMe SSD)
Knowledge of SPDK basics (optional but helpful for comparison)

Step-by-Step Guide

Step 1: Analyze Current I/O Overhead

Begin by measuring the existing I/O performance per core to establish a baseline. Use tools like fio with synchronous I/O (e.g., libaio) and IO_uring. Focus on metrics such as IOPS, latency, and CPU utilization per core. At LSFMM, the comparison with SPDK revealed that Linux’s I/O stack introduces significant overhead, especially in the path from user space through the block layer to the device. Document the bottlenecks—look at syscall overhead, lock contention, and data copying. Axboe’s motivation came from seeing that SPDK could achieve much lower overhead by bypassing the kernel entirely. Your goal is to identify where the kernel spends extra time: is it in the completion path, in memory barriers, or in the submission side?

Step 2: Profile Kernel I/O Paths

Use perf to trace the I/O path for both submission and completion. Pay special attention to the io_submit and io_getevents functions for AIO, or the ring buffer handling in IO_uring. Look for cache misses, branch mispredictions, and spinlock contention. In Axboe’s case, he likely found that per-core operations were serialized by global locks or inefficient data structures. For instance, the block layer’s request queue lock might have been a bottleneck. Profile with perf record -e cycles:u -c 100000 while running a low-queue-depth workload to amplify per-core effects. Record the key functions consuming CPU time per core.

Step 3: Identify Per-Core Bottlenecks

Compare the profiled data against an ideal model. For per-core I/O, the ideal is that each core can submit and complete I/O without interference from other cores. Check for shared structures like the blk_mq tags, software queues, and completion queues. In the original work, Axboe likely observed that even with IO_uring’s per-task files_struct, there were still global points of contention. For example, the IO_uring CQ (completion queue) ring and SQ (submission queue) ring may have been padded incorrectly or used memory ordering barriers unnecessarily. Your step: list every shared resource touched during a single I/O operation and assess whether it can be made per-core.

Step 4: Design Per-Core Optimizations

Based on your analysis, design patches that minimize cross-core communication. Key techniques include:

Per-core data structures: Separate submission and completion rings per CPU, or per process pinned to a core.
Lock removal: Use RCU or atomic operations instead of spinlocks for hot paths.
Batch processing: Amortize overhead by handling multiple I/Os in one go.
Cache line alignment: Ensure that frequently accessed fields are on different cache lines to avoid false sharing.

Axboe’s specific patches (as reported) focused on streamlining the IO_uring completion path and reducing overhead in the block layer’s multi-queue (blk-mq) mapping. For instance, he might have introduced a per-core completion queue that bypasses the global shared ring. Document your intended changes with clear rationale.

Step 5: Implement and Test Patches

Write the kernel patches using the usual git format-patch workflow. Apply them to a clean kernel tree and rebuild. Test incrementally: after each patch, run your baseline benchmark again. Look for improvements in per-core IOPS and reductions in CPU usage per I/O. Axboe reported a ~60% increase, so your patches should ideally show a significant jump. Monitor for regressions in other workloads (e.g., multi-core high concurrency). Use fio with --thread --numjobs=1 to isolate per-core behavior, and also test with --numjobs=16 to ensure scalability.

Step 6: Validate Against SPDK Baseline

For a reference point, run the same benchmarks using SPDK (if possible). The original presentation showed that SPDK had lower overhead. Your optimized kernel should approach that performance. If not, return to Step 2 and refine. Measure the reduction in syscall latency, context switching, and data copying. A key insight from Axboe’s work is that even after removing system call overhead (via IO_uring), further gains come from reducing per-command overhead within the kernel.

Step 7: Tune and Merge

Once your patches achieve the desired improvement, polish them: add proper comments, ensure they align with kernel coding style, and submit them to the relevant mailing lists (e.g., linux-block, io-uring). Axboe’s patches were likely merged quickly because they addressed a clear performance gap. Monitor community feedback—there might be corner cases where your optimizations hurt fairness or latency under certain loads. For a guide, this step emphasizes the importance of review and iteration.

Tips for Success

Start with a focused problem: Axboe didn’t rewrite the entire I/O stack; he targeted specific overheads. Narrow your scope to per-core paths.
Use modern tools: bpftrace and trace-cmd can give you finer-grained insights than traditional profiling.
Test on real hardware: Virtualization can obscure per-core effects. Use a bare-metal system with an NVMe drive.
Watch for regressions: Per-core optimizations sometimes hurt cross-core performance. Always benchmark mixed workloads.
Learn from SPDK: Even if you don’t use SPDK, its design principles—user-space drivers, lockless rings, polling mode—can inspire kernel improvements.
Engage with the community: The LSFMM summit and the linux-block list are valuable forums. Axboe’s presentation and subsequent patches show how real-world feedback drives innovation.

By following these steps, you can replicate the approach that led to a 60% increase in per-core I/O performance. Remember that such gains are not magical; they come from meticulous analysis, targeted changes, and rigorous testing. With the right methodology, every Linux developer can contribute to making the kernel’s I/O path faster and more scalable.