Optimizing Multithreaded Performance: Avoiding False Sharing

Table of Contents

Two Almost Identical Programs
#

Let’s start with two simple multithreaded C programs.

`test1.c`
#

This version defines a global structure and spawns two threads. Thread 1 updates data.a, and thread 2 updates data.b.

struct {
    int a;
    int b;
} data;

void* thread_1(void* arg) {
    for (int i = 0; i < 100000000; i++)
        data.a++;
    return NULL;
}

void* thread_2(void* arg) {
    for (int i = 0; i < 100000000; i++)
        data.b++;
    return NULL;
}

`test2.c`
#

In this version, we only change one thing — we swap the order of the structure’s fields:

struct {
    int b;
    int a;
} data;

That’s it. Same logic, same number of threads, just swapped struct fields.

The Astonishing Performance Gap
#

When both programs are compiled and timed using time, the results look like this:

test1: 4.85 seconds
test2: 1.25 seconds

That’s almost a 4× difference — just from swapping two fields! Why does such a trivial change cause a massive performance gap?

Understanding the Real Cause: CPU Cache Behavior
#

To explain this, we need to understand how modern CPUs handle memory.

CPU Caches
#

Modern CPUs are much faster than main memory. To bridge that gap, CPUs use a hierarchy of caches — typically L1, L2, and L3.

When the CPU reads or writes data, it first checks whether the value exists in its cache:

Cache hit → Fast access.
Cache miss → Fetch from main memory (slow).

Cache Lines
#

Cache and memory communicate in fixed-size chunks called cache lines — usually 64 bytes each.

That means when the CPU reads one variable, it actually pulls a whole 64-byte block around it into cache.

Cache Coherency in Multicore Systems
#

In multi-core systems, each core has its own local cache. When multiple cores work on shared memory, the same variable may exist in several caches.

To keep everything consistent, CPUs use a cache coherency protocol (like MESI). Whenever one core modifies a cache line, all other cores with that line must invalidate their copies.

This invalidation is what causes performance trouble.

The Hidden Culprit: False Sharing
#

Even if two threads modify different variables, problems arise when those variables happen to be on the same cache line.

That’s called false sharing.

Whenever one thread updates its variable, the entire cache line becomes “dirty,” forcing the other thread’s core to invalidate its local cache copy — even though it’s working on a different variable in that same line.

The result: Both threads keep invalidating and reloading cache lines, creating a performance bottleneck.

Applying This to Our Two Programs
#

In test1.c, the fields a and b are adjacent in memory. They sit on the same cache line, so two threads writing to them on different cores constantly invalidate each other’s caches.

In test2.c, the fields are separated by enough padding so that they end up on different cache lines. Now each thread can update its variable independently — no more invalidations, no more slowdown.

That’s why test2 runs almost four times faster.

How to Detect and Avoid False Sharing
#

Detection
#

On Linux, you can analyze false sharing using the perf c2c tool (cache-to-cache analysis). It helps identify which memory regions or variables cause unnecessary cache line contention.

Prevention
#

To avoid false sharing:

Separate frequently updated variables so they fall on different cache lines.
Use padding (e.g., 64 bytes) or alignas(64) to force separation.
Group read-only data together, but isolate write-heavy data per thread.

Example:

struct {
    int a;
    char pad[64];  // padding to separate cache lines
    int b;
} data;

This ensures a and b lie on different cache lines, eliminating the false sharing problem.

Final Thoughts
#

False sharing is one of those performance killers that hides in plain sight — invisible at the source code level, but devastating in runtime performance.

By understanding how CPU caches work and being mindful of cache line boundaries, you can unlock major performance gains in multithreaded applications.