Skip to main content

CUDA Memory Explained: Global, Shared, Constant, Registers, and Caching

·562 words·3 mins
CUDA GPU
Table of Contents

Introduction to Memory Hierarchy
#

Modern computer architectures use a memory hierarchy, where different types of memory vary in capacity and latency (the time a processor waits for data). In general, smaller and faster memory has lower latency but limited size, while larger memory is slower.

CUDA memory hierarchy

Types of CUDA Memory
#

Global Memory
#

  • Accessible by all threads in a kernel, but not the same as C++ global variables.
  • Commonly used for passing data between host and device or between devices.
  • Resides off-chip, which means high latency and lower performance compared to on-chip memory.
  • Allocated with cudaMalloc, freed with cudaFree.
  • Lifetime is determined by host-side memory management, not by the kernel.

Static Global Memory
#

Defined outside any function using __device__:

__device__ T x;     // single variable
__device__ T y[N];  // fixed-size array
  • Visible to all threads, but cannot be accessed directly by the host.
  • Host ↔ device transfers use cudaMemcpyToSymbol and cudaMemcpyFromSymbol.

Constant Memory
#

  • A read-only global memory space with a total size of 64 KB.
  • Cached, so access is much faster than global memory if threads in a warp read the same value.
  • Declared with __constant__ and initialized using cudaMemcpyToSymbol.
  • Useful for storing parameters or lookup tables shared across threads.

Texture and Surface Memory
#

  • Also cached global memory.
  • Typically read-only (surface memory can be writable).
  • Often used for graphics, image processing, or irregular memory access patterns.

Registers
#

  • Default storage for thread-local variables.
  • Fastest memory in CUDA, but limited in number.
  • Each thread has its own set of registers.
  • Example:
const int n = blockDim.x * blockIdx.x + threadIdx.x;
z[n] = x[n] + y[n];

Here n is a register variable, private to the thread.

Local Memory
#

  • Used when registers overflow or when array indexing cannot be resolved at compile time.
  • Despite the name, local memory is stored in global memory, so it has high latency.

Local vs Register memory

Shared Memory
#

  • On-chip memory accessible by all threads in a block.
  • Much faster than global memory, second only to registers.
  • Each block has its own shared memory space.
  • Ideal for data sharing within a block.

L1 and L2 Cache
#

  • Introduced with the Fermi architecture.
  • L1 cache: per Streaming Multiprocessor (SM).
  • L2 cache: shared across the device.
  • Helps reduce latency for global and local memory access.

Streaming Multiprocessors (SMs)
#

Each GPU consists of multiple SMs, each containing:

  • Registers and shared memory
  • Constant and texture caches
  • L1 cache
  • Warp schedulers for instruction execution
  • Execution cores (INT32, FP32, FP64, SFUs, Tensor cores)

SM Occupancy
#

Occupancy refers to the ratio of active threads per SM to the maximum supported.

  • High occupancy can improve performance but is not always required.
  • A common target is at least 25% occupancy.

Querying Device Properties with CUDA Runtime API
#

Example program to check GPU properties:

#include "stdio.h"

#define CHECK(call)                                   \
do {                                                  \
    const cudaError_t error_code = call;              \
    if (error_code != cudaSuccess) {                  \
        printf("CUDA Error: %s\n",                    \
            cudaGetErrorString(error_code));          \
        exit(1);                                      \
    }                                                 \
} while (0)

int main(int argc, char* argv[]) {
    int device_id = 0;
    if (argc > 1) device_id = atoi(argv[1]);
    CHECK(cudaSetDevice(device_id));

    cudaDeviceProp prop;
    CHECK(cudaGetDeviceProperties(&prop, device_id));

    printf("Device name: %s\n", prop.name);
    printf("Compute capability: %d.%d\n", prop.major, prop.minor);
    printf("Global memory: %g GB\n", prop.totalGlobalMem / (1024.0 * 1024 * 1024));
    printf("Shared memory per block: %g KB\n", prop.sharedMemPerBlock / 1024.0);
    printf("Registers per SM: %d K\n", prop.regsPerMultiprocessor / 1024);
    printf("Max threads per block: %d\n", prop.maxThreadsPerBlock);
    return 0;
}

CUDA device properties

Related

How to Write a Qualified C++ Class
·613 words·3 mins
C++
为什么C语言没人喷
·11 words·1 min
程序 C Programming
C Network Programming: Managing Sockets with epoll
·525 words·3 mins
C Epoll Socket Network Programming Linux