Latency Engineering / Runtime Isolation

NullRing

The project asks a narrow question and answers it honestly: once the handoff path is reduced to the essentials, the remaining latency belongs as much to the machine as it does to the code.

Observed Floor

92ns

best observed end-to-end sample, cache-warm

Median Latency

142ns

producer -> ring -> evaluator path

Evaluator Throughput

52.7M/s

tight-loop evaluator benchmark, not full pipeline

Master Schematic

System Control & Data Flow Under Windows User-Space

NullRing System Architecture and Data Flow

Expand Full Detail

Complete architectural flow showing L1-L3 cache interactions, thread affinity, and the branchless evaluator execution path.

The Premise

Isolating the lower bound of user-space risk evaluation.

NullRing exists to answer a very narrow question: once locks, dynamic allocation, and unnecessary sharing are stripped out of the hot path, how much latency belongs to the machine, and how much belongs to the code?

Low-latency infrastructure is widely discussed in vague terms, but modern systems work increasingly depends on understanding exactly where software ends and hardware behavior begins. Without a design like this, latency budgets disappear into allocator work and scheduler wake-ups long before business logic becomes the real constraint.

The durability of NullRing comes from this underlying need. Engineers will continue to care about cache behavior and honest tail latency long after specific APIs or frameworks change.

8-Phase Optimization Journey

Naive Baseline

800ns

RDTSC Integration

800ns

RDTSCP Serialization

800ns

Thread Affinity

180ns

Cache Line Alignment

155ns

VirtualLock + TLB

155ns

REALTIME Priority

150ns

Branchless Evaluator

142ns

INTERACTIVE SIMULATION

Core 2 (Producer)

MODIFIED

L3 Cache Arbiter

50-150 cycles

Core 3 (Consumer)

SHARED

Animation showing the physical delay of transferring a cache line from Core 2's L1 cache through the L3 arbiter into Core 3's L1 cache. This physical movement is the true floor of the latency budget.

A Deliberately Narrow Handoff

The architecture is intentionally small.

One producer core publishes into a power-of-two SPSC ring, and one consumer core evaluates the event. Narrowness is a design choice, not a missing feature.

This removes avoidable hot-path costs by fixing the payload shape, using acquire/release atomics, and keeping the ownership transfer model strictly contained. Shared mutable state is entirely avoided, making the synchronization contract small and the ownership boundaries explicit.

The defensibility comes from engineering judgment: knowing which abstractions to remove, how to measure honestly, and which caveats to keep visible even when they weaken the headline.

single-producer single-consumer ring
acquire on consume, release on publish
cache-line-separated head and tail indices

Memory Layout Details

The budget is smaller than a routine abstraction cost.

At this latency scale, a heap allocation, stray branch, or false-sharing mistake can cost more than the entire target budget. The implementation detail *is* the story.

The strongest differentiator isn't just the ring buffer. It is the combination of fixed-point evaluation, cache-line separation, explicit thread affinity, and benchmark transparency.

By enforcing strict 64-byte padding alignments on the ring indices, we guarantee that the producer and consumer cores never contest the same cache line during concurrent operations, eliminating the false sharing penalties that typically plague multi-threaded queues.

False Sharing Simulation

Core 0 (Head)

Core 1 (Tail)

SAME 64B CACHE LINE

HEAD

TAIL

Hardware invalidates the entire cache line whenever either core writes, causing a massive latency spike across the interconnect.

Implementation Scope

The Full Pipeline

Expand Full Detail

The complete implementation boundary, showing exactly where user-space software interacts with hardware buffers.

Optimization Journey

~800ns

std::chrono, no affinity

Naive Baseline

A simple producer-consumer pipeline. Median latency was ~500-800ns with enormous variance. std::chrono resolution is too coarse, hiding the true shape of the latency curve.

Implementation

auto start = std::chrono::high_resolution_clock::now();
// ... execute ...
auto end = std::chrono::high_resolution_clock::now();

Source

RDTSC Integration

Replaced std::chrono with __rdtsc() for cycle-accurate measurement. The measurements became precise enough to reveal a bimodal distribution: events hitting L1 vs events triggering inter-core cache transfers.

Implementation

inline uint64_t rdtsc() {
    unsigned int lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

Source

RDTSCP Serialization

__rdtsc can be speculatively executed, artificially lowering latency readings. __rdtscp acts as a serialization barrier, adding ~30–50 cycles of overhead but ensuring the measurement is honest.

Implementation

inline uint64_t rdtscp() {
    unsigned int lo, hi, aux;
    __asm__ __volatile__ ("rdtscp" : "=a" (lo), "=d" (hi), "=c" (aux));
    return ((uint64_t)hi << 32) | lo;
}

Source

Core Affinity

By binding the producer and consumer to specific logical cores via sched_setaffinity, we eliminated OS thread migration. This caused the biggest absolute drop in median latency.

Implementation

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);

Source

Padding & False Sharing

The head and tail pointers of the ring buffer were occupying the same cache line. Every time the producer wrote, it invalidated the consumer's cache. Aligning to 64-byte boundaries fixed the ping-ponging.

Implementation

struct alignas(64) PaddedEvent {
    Event data;
    // ... padding implicitly added by compiler
};

Source

Relaxed Atomics

Replaced sequential consistency (seq_cst) with acquire/release semantics. The CPU is now allowed to reorder non-dependent instructions around the atomic operations.

Implementation

tail_.store(next_tail, std::memory_order_release);
// ... consumer side ...
auto current_tail = tail_.load(std::memory_order_acquire);

Source

Branchless Evaluation

Eliminated early-exits in the evaluator. The CPU evaluates all branches using bitwise ops and CMOV (conditional move) to avoid expensive branch misprediction penalties entirely.

Implementation

// Branchless assignment
uint64_t mask = -(val > threshold); // 0 or 0xFFF...
result = (a & mask) | (b & ~mask);

Source

Fixed-Point Math

Floats cause variable latency depending on the ALU state (e.g., denormal penalties). Migrating the evaluator engine to 64-bit integer fixed-point math provided the final, deterministic ceiling.

Implementation

// Fixed-point scale: 1.0 = 1 << 16
int64_t scaled_val = (val * MULTIPLIER) >> 16;

Source

Proof of Execution

Interpreting the Absolute Truth

Latency Distribution Simulation

MEDIAN

142ns

p99.9 TAIL

2346ns

OS Scheduler Context Switch Domain (Hardware Interrupt)

Raw Output

╔══════════════════════════════════════════════════════════╗
║ NullRing — Latency Benchmarking Suite                    ║
╠══════════════════════════════════════════════════════════╣
║ Events:    1000000 ║  Ring Size: 65536 ║  Warm-up: 10000 ║
╚══════════════════════════════════════════════════════════╝

── Benchmark 2: End-to-end (Producer → Ring → Evaluator) ──
  Average         478 cycles   150 ns
  Median (p50)    453 cycles   142 ns
  p95             517 cycles   162 ns
  p99             549 cycles   172 ns
  p99.9          7493 cycles  2346 ns ← OS scheduler domain
  Min             293 cycles    92 ns

What do these numbers mean?

142ns Median

A nanosecond is one billionth of a second. For context, light travels about one foot in a single nanosecond. This system processes a full risk event across cores in 142ns. A standard memory allocation (\`malloc\`) alone can easily take longer than this entire pipeline.

p99.9 Tail Latency

Out of 1,000 events, 999 are processed faster than 2,346ns. This rare "tail latency" is what kills performance in high-frequency environments, and measuring it honestly is the hardest part of systems benchmarking.

The OS Scheduler Domain

At these microscopic speeds, our code is no longer the bottleneck. The tail latency spikes are caused by the operating system pausing our thread to handle hardware interrupts. We have hit the physical limits of user-space software.

Next Case

AccretionDB

Return to Index Open Next Case