Boosting Linux Per-Core I/O Performance: A Developer's Guide

Introduction

In the fast-evolving world of Linux storage, achieving maximum per-core I/O throughput is a constant challenge. Recent work by Jens Axboe, the lead developer of IO_uring and maintainer of the Linux block layer, has demonstrated a remarkable 60% increase in per-core I/O performance through targeted kernel patches. Inspired by a presentation at the Linux Storage, Filesystem, Memory Management and BPF Summit (LSFMM) in Croatia—where the I/O overhead of Linux was compared unfavorably to the Storage Performance Development Kit (SPDK)—Axboe set out to close the gap. This guide walks you through the methodology and steps he used to identify, analyze, and implement optimizations that dramatically improve per-core I/O. Whether you are a kernel hacker, a storage engineer, or a performance enthusiast, these steps will help you understand the process behind such impactful changes.

Boosting Linux Per-Core I/O Performance: A Developer's Guide

What You Need

Before diving in, ensure you have the following prerequisites:

Step-by-Step Guide

Step 1: Analyze Current I/O Overhead

Begin by measuring the existing I/O performance per core to establish a baseline. Use tools like fio with synchronous I/O (e.g., libaio) and IO_uring. Focus on metrics such as IOPS, latency, and CPU utilization per core. At LSFMM, the comparison with SPDK revealed that Linux’s I/O stack introduces significant overhead, especially in the path from user space through the block layer to the device. Document the bottlenecks—look at syscall overhead, lock contention, and data copying. Axboe’s motivation came from seeing that SPDK could achieve much lower overhead by bypassing the kernel entirely. Your goal is to identify where the kernel spends extra time: is it in the completion path, in memory barriers, or in the submission side?

Step 2: Profile Kernel I/O Paths

Use perf to trace the I/O path for both submission and completion. Pay special attention to the io_submit and io_getevents functions for AIO, or the ring buffer handling in IO_uring. Look for cache misses, branch mispredictions, and spinlock contention. In Axboe’s case, he likely found that per-core operations were serialized by global locks or inefficient data structures. For instance, the block layer’s request queue lock might have been a bottleneck. Profile with perf record -e cycles:u -c 100000 while running a low-queue-depth workload to amplify per-core effects. Record the key functions consuming CPU time per core.

Step 3: Identify Per-Core Bottlenecks

Compare the profiled data against an ideal model. For per-core I/O, the ideal is that each core can submit and complete I/O without interference from other cores. Check for shared structures like the blk_mq tags, software queues, and completion queues. In the original work, Axboe likely observed that even with IO_uring’s per-task files_struct, there were still global points of contention. For example, the IO_uring CQ (completion queue) ring and SQ (submission queue) ring may have been padded incorrectly or used memory ordering barriers unnecessarily. Your step: list every shared resource touched during a single I/O operation and assess whether it can be made per-core.

Step 4: Design Per-Core Optimizations

Based on your analysis, design patches that minimize cross-core communication. Key techniques include:

Axboe’s specific patches (as reported) focused on streamlining the IO_uring completion path and reducing overhead in the block layer’s multi-queue (blk-mq) mapping. For instance, he might have introduced a per-core completion queue that bypasses the global shared ring. Document your intended changes with clear rationale.

Step 5: Implement and Test Patches

Write the kernel patches using the usual git format-patch workflow. Apply them to a clean kernel tree and rebuild. Test incrementally: after each patch, run your baseline benchmark again. Look for improvements in per-core IOPS and reductions in CPU usage per I/O. Axboe reported a ~60% increase, so your patches should ideally show a significant jump. Monitor for regressions in other workloads (e.g., multi-core high concurrency). Use fio with --thread --numjobs=1 to isolate per-core behavior, and also test with --numjobs=16 to ensure scalability.

Step 6: Validate Against SPDK Baseline

For a reference point, run the same benchmarks using SPDK (if possible). The original presentation showed that SPDK had lower overhead. Your optimized kernel should approach that performance. If not, return to Step 2 and refine. Measure the reduction in syscall latency, context switching, and data copying. A key insight from Axboe’s work is that even after removing system call overhead (via IO_uring), further gains come from reducing per-command overhead within the kernel.

Step 7: Tune and Merge

Once your patches achieve the desired improvement, polish them: add proper comments, ensure they align with kernel coding style, and submit them to the relevant mailing lists (e.g., linux-block, io-uring). Axboe’s patches were likely merged quickly because they addressed a clear performance gap. Monitor community feedback—there might be corner cases where your optimizations hurt fairness or latency under certain loads. For a guide, this step emphasizes the importance of review and iteration.

Tips for Success

By following these steps, you can replicate the approach that led to a 60% increase in per-core I/O performance. Remember that such gains are not magical; they come from meticulous analysis, targeted changes, and rigorous testing. With the right methodology, every Linux developer can contribute to making the kernel’s I/O path faster and more scalable.

Recommended

Discover More

Everything About Open source package with 1 million monthly downloads stole u...Apple's Next Big AI Move: Visual Intelligence in iOS 27 Camera App, Tim Cook Reflects on Career, and iPhone Battery Drain WoesSafari Technology Preview 237: Accessibility and CSS Enhancements Lead the WayCreating a Conversational C-3PO Head: An AI ProjectExodus Inks Landmark UFC Deal, Unveils Self-Custody Money App in Major Brand Pivot