Memory

Honestly, the diagrams that I wish to reproduce already exist here. Currently this page is in construction and probably will be until I finish my Doctorate.

“Memory is the mother of all wisdom." — Aeschylus

Babbage's Big Brain

Memory as a Hierarchy — Not a Monolith

Hierarchy exists for two intertwined reasons:

Physics – Smaller structures are faster and nearer to ALUs but hold less data; larger structures store more but are farther away and thus slower.
Economics – Fast memory costs disproportionately more per byte.

An efficient system arranges multiple layers so that > the majority of accesses hit the small, fast part, > while the bulk of bytes reside in the large, cheap part.

The Register File – Your Fastest Scratchpad

Modern x86-64 cores provide 16 – 32 architectural registers (`%rax`, `%rbx`, …) plus special purpose ones (program counter PC/IP, stack pointer %rsp, etc.). Access is single-cycle and superscalar execution renames physical registers to squeeze out more parallelism.

Practical take-away: keep live data small and in registers; compilers often need help via `restrict`, `const`, or `register` hints.

Caches: The Illusion of Speed

Locality (temporal & spatial) is the hinge on which caches swing.

Level	Typical Size	Line / Block	Hit ≈ cycles	Miss goes to
L1d	32 – 64 KiB	64 B	3–5	L2
L2	256 KiB–2 MiB	64 B	10–15	L3
L3	4 – 64 MiB (shared)	64 B	30–50	DRAM

Policies

Write-back keeps writes local until eviction
Replacement often PLRU or variants, not true LRU
Coherence (MESI, MOESI) keeps multi-core views in sync

Compiler flags like `-O3 -march=native` align loops, unroll, vectorise, and prefetch to exploit these caches.

Main Memory – DRAM, Rows and Banks

Dynamic RAM stores bits as charge in capacitors that must be refreshed (~64 ms). It is organised into channels → DIMMs → ranks → banks → rows → columns. Parallelism across banks hides activate/precharge delays.

💡 Row buffer hits are the DRAM analogue of a cache hit.

Virtual Memory and the Page Table

Every process uses a flat 64-bit address space, translated by hardware page table walkers into physical frames.

Typical page = 4 KiB; huge pages = 2 MiB or 1 GiB
TLB (Translation Look-aside Buffer) caches recent mappings – a miss costs ≳ 30 cycles even before DRAM latency.

`mmap`, `malloc`, stacks and shared libraries are simply differently-protected regions in that same virtual space.

Stack vs Heap – Two Growth Patterns

Property	Stack	Heap
Lifespan	Automatic (scope-bound)	Manual / GC
Growth direction	Down in x86-64	Up (allocator-dependent)
Allocation cost	1 instruction (`sub rsp`)	`malloc` – usually `O(1)` average, but slower
Typical use-case	Local variables, return PCs	Dynamic data structures

Anatomy of a Function Call

The stack pointer moves just once per call entry/exit. High-performance code keeps frames shallow to stay inside L1.

CPU vs GPU Memory — Different Beasts

Where CPUs favour latency, GPUs are built for throughput.

Aspect	CPU	GPU
Core count	8 – 64 “fat” cores	Hundreds–thousands of “thin” cores
L1/L2 size per SM	32–64 KiB (data)	64–128 KiB split shared mem / cache
Global memory	DDR4/5, ~50–100 GB/s	GDDR6/HBM ≈ 0.5–1 TB/s
Access granularity	64-byte lines	32-byte warps; coalescing crucial
Cache coherence	Hardware-coherent within socket	Often manual (CUDA `__syncthreads`, barriers)
Host interaction	Unified virtual mem (recent)	PCIe/ NVLink latency, explicit copies still common

Key GPU concepts

Warp divergence – Threads in a warp that take different branches stall others. Shared memory – Programmer-managed scratchpad; think “user-controlled L1”. Memory coalescing – Adjacent threads should access adjacent addresses.