Introduction to Memory Hierarchy #
Modern computer architectures use a memory hierarchy, where different types of memory vary in capacity and latency (the time a processor waits for data). In general, smaller and faster memory has lower latency but limited size, while larger memory is slower.
Types of CUDA Memory #
Global Memory #
- Accessible by all threads in a kernel, but not the same as C++ global variables.
- Commonly used for passing data between host and device or between devices.
- Resides off-chip, which means high latency and lower performance compared to on-chip memory.
- Allocated with
cudaMalloc
, freed withcudaFree
. - Lifetime is determined by host-side memory management, not by the kernel.
Static Global Memory #
Defined outside any function using __device__
:
__device__ T x; // single variable
__device__ T y[N]; // fixed-size array
- Visible to all threads, but cannot be accessed directly by the host.
- Host ↔ device transfers use
cudaMemcpyToSymbol
andcudaMemcpyFromSymbol
.
Constant Memory #
- A read-only global memory space with a total size of 64 KB.
- Cached, so access is much faster than global memory if threads in a warp read the same value.
- Declared with
__constant__
and initialized usingcudaMemcpyToSymbol
. - Useful for storing parameters or lookup tables shared across threads.
Texture and Surface Memory #
- Also cached global memory.
- Typically read-only (surface memory can be writable).
- Often used for graphics, image processing, or irregular memory access patterns.
Registers #
- Default storage for thread-local variables.
- Fastest memory in CUDA, but limited in number.
- Each thread has its own set of registers.
- Example:
const int n = blockDim.x * blockIdx.x + threadIdx.x;
z[n] = x[n] + y[n];
Here n
is a register variable, private to the thread.
Local Memory #
- Used when registers overflow or when array indexing cannot be resolved at compile time.
- Despite the name, local memory is stored in global memory, so it has high latency.
Shared Memory #
- On-chip memory accessible by all threads in a block.
- Much faster than global memory, second only to registers.
- Each block has its own shared memory space.
- Ideal for data sharing within a block.
L1 and L2 Cache #
- Introduced with the Fermi architecture.
- L1 cache: per Streaming Multiprocessor (SM).
- L2 cache: shared across the device.
- Helps reduce latency for global and local memory access.
Streaming Multiprocessors (SMs) #
Each GPU consists of multiple SMs, each containing:
- Registers and shared memory
- Constant and texture caches
- L1 cache
- Warp schedulers for instruction execution
- Execution cores (INT32, FP32, FP64, SFUs, Tensor cores)
SM Occupancy #
Occupancy refers to the ratio of active threads per SM to the maximum supported.
- High occupancy can improve performance but is not always required.
- A common target is at least 25% occupancy.
Querying Device Properties with CUDA Runtime API #
Example program to check GPU properties:
#include "stdio.h"
#define CHECK(call) \
do { \
const cudaError_t error_code = call; \
if (error_code != cudaSuccess) { \
printf("CUDA Error: %s\n", \
cudaGetErrorString(error_code)); \
exit(1); \
} \
} while (0)
int main(int argc, char* argv[]) {
int device_id = 0;
if (argc > 1) device_id = atoi(argv[1]);
CHECK(cudaSetDevice(device_id));
cudaDeviceProp prop;
CHECK(cudaGetDeviceProperties(&prop, device_id));
printf("Device name: %s\n", prop.name);
printf("Compute capability: %d.%d\n", prop.major, prop.minor);
printf("Global memory: %g GB\n", prop.totalGlobalMem / (1024.0 * 1024 * 1024));
printf("Shared memory per block: %g KB\n", prop.sharedMemPerBlock / 1024.0);
printf("Registers per SM: %d K\n", prop.regsPerMultiprocessor / 1024);
printf("Max threads per block: %d\n", prop.maxThreadsPerBlock);
return 0;
}