CUDA Memory Explained: Global, Shared, Constant, Registers, and Caching

Table of Contents

Introduction to Memory Hierarchy
#

Modern computer architectures use a memory hierarchy, where different types of memory vary in capacity and latency (the time a processor waits for data). In general, smaller and faster memory has lower latency but limited size, while larger memory is slower.

Types of CUDA Memory
#

Global Memory
#

Accessible by all threads in a kernel, but not the same as C++ global variables.
Commonly used for passing data between host and device or between devices.
Resides off-chip, which means high latency and lower performance compared to on-chip memory.
Allocated with cudaMalloc, freed with cudaFree.
Lifetime is determined by host-side memory management, not by the kernel.

Static Global Memory
#

Defined outside any function using __device__:

__device__ T x;     // single variable
__device__ T y[N];  // fixed-size array

Visible to all threads, but cannot be accessed directly by the host.
Host ↔ device transfers use cudaMemcpyToSymbol and cudaMemcpyFromSymbol.

Constant Memory
#

A read-only global memory space with a total size of 64 KB.
Cached, so access is much faster than global memory if threads in a warp read the same value.
Declared with __constant__ and initialized using cudaMemcpyToSymbol.
Useful for storing parameters or lookup tables shared across threads.

Texture and Surface Memory
#

Also cached global memory.
Typically read-only (surface memory can be writable).
Often used for graphics, image processing, or irregular memory access patterns.

Registers
#

Default storage for thread-local variables.
Fastest memory in CUDA, but limited in number.
Each thread has its own set of registers.
Example:

const int n = blockDim.x * blockIdx.x + threadIdx.x;
z[n] = x[n] + y[n];

Here n is a register variable, private to the thread.

Local Memory
#

Used when registers overflow or when array indexing cannot be resolved at compile time.
Despite the name, local memory is stored in global memory, so it has high latency.

Shared Memory
#

On-chip memory accessible by all threads in a block.
Much faster than global memory, second only to registers.
Each block has its own shared memory space.
Ideal for data sharing within a block.

L1 and L2 Cache
#

Introduced with the Fermi architecture.
L1 cache: per Streaming Multiprocessor (SM).
L2 cache: shared across the device.
Helps reduce latency for global and local memory access.

Streaming Multiprocessors (SMs)
#

Each GPU consists of multiple SMs, each containing:

Registers and shared memory
Constant and texture caches
L1 cache
Warp schedulers for instruction execution
Execution cores (INT32, FP32, FP64, SFUs, Tensor cores)

SM Occupancy
#

Occupancy refers to the ratio of active threads per SM to the maximum supported.

High occupancy can improve performance but is not always required.
A common target is at least 25% occupancy.

Querying Device Properties with CUDA Runtime API
#

Example program to check GPU properties:

#include "stdio.h"

#define CHECK(call)                                   \
do {                                                  \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess) {                  \
        printf("CUDA Error: %s\n",                    \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)

int main(int argc, char* argv[]) {
    int device_id = 0;
    if (argc > 1) device_id = atoi(argv[1]);
    CHECK(cudaSetDevice(device_id));

    cudaDeviceProp prop;
    CHECK(cudaGetDeviceProperties(&prop, device_id));

    printf("Device name: %s\n", prop.name);
    printf("Compute capability: %d.%d\n", prop.major, prop.minor);
    printf("Global memory: %g GB\n", prop.totalGlobalMem / (1024.0 * 1024 * 1024));
    printf("Shared memory per block: %g KB\n", prop.sharedMemPerBlock / 1024.0);
    printf("Registers per SM: %d K\n", prop.regsPerMultiprocessor / 1024);
    printf("Max threads per block: %d\n", prop.maxThreadsPerBlock);
    return 0;
}