Multiprocessor Systems
Computer architects maintained performance growth consistent with Moore’s Law through the early 2000s by developing single processors with progressively more sophisticated instruction-level parallelism methods and higher clock frequencies. However, further increases in CPU clock speeds beyond this point would have resulted in prohibitively high power consumption. This constraint ushered in the modern age of multicore and multithreaded processor designs, which depend on programmers explicitly writing parallel code to accelerate individual program execution. A multiprocessor system consists of multiple processors working together. The primary goal of multiprocessor architecture is to enhance overall system execution speed. These systems are categorized based on two key factors: the mechanism processors use to access memory, and whether the system uses identical processors or a mix of different processor types.
Shared Memory Multiprocessor Systems
Shared memory multiprocessor systems are parallel computing architectures where multiple processors share access to a common physical memory space. In these systems, all processors can directly read from and write to the same memory locations, enabling communication and data sharing through shared variables rather than explicit message passing. Despite the fact that each processor has its own seperate, private cache, these systems need to introduce cache coherency mechanisms to ensure all processors see a consistent view of memory (i.e., they all eventually see the same, correct values). There are also two kinds of shared memory multiprocessor systems: UMA (Uniform Memory Access) Systems and NUMA (Non-Uniform Memory Access) Systems.
Uniform Memory-Access (UMA) System
In a UMA system, all processors have equal access time to all memory. UMA systems are typically implemented as Symmetric Multiprocessing (SMP), which refers to an architecture with multiple identical processors running under a single operating system with shared access to centralized main memory. Each processor can execute different programs and operate on different data while sharing common resources—such as memory, I/O devices, and the interrupt system—through an interconnecting system bus. Each processor also maintains its own local cache memory that serves as a fast intermediary between the processor and main memory.
The critical challenge in SMP systems is maintaining cache coherency: ensuring all processors see a consistent view of memory despite each having its own private cache. When one processor modifies data, other processors may have stale copies in their caches, creating inconsistency. SMP systems solve this through bus snooping, where each processor monitors a shared bus for cache events from other processors and updates its cache accordingly.
Cache coherency is managed through hardware protocols, with MESI being the most common (other protocols include MOESI and MESIF). MESI stands for the four states a cache line can be in: Modified, Exclusive, Shared, or Invalid.
When a processor reads data not in any other cache, it loads the data and marks it “Exclusive”. If another cache already has it, both mark it as “Shared”—a read-only state that allows multiple processors to safely read from their local caches without coordination. When a processor writes to a shared cache line, it must first send an invalidate message on the bus to all other caches, which mark their copies Invalid while the writing cache marks its copy Modified. Modified data is written back to main memory either immediately (write-through) or when evicted from the cache (write-back), depending on the cache write policy.
However, cache coherency mechanisms introduce two notable performance problems. False sharing occurs when different variables that happen to reside on the same cache line are modified by different processors. Since cache coherency operates at cache line granularity, modifications to logically independent variables trigger unnecessary invalidations across processors. For example, if Thread A frequently updates counter_a and Thread B frequently updates counter_b, but both counters are adjacent in memory and share the same cache line, each write causes the cache line to bounce between processors even though the threads aren’t actually sharing data. This ping-pong effect can cause severe performance degradation in multithreaded code. False sharing is solved through careful memory layout: padding structures to ensure frequently-modified variables occupy separate cache lines, or using alignment directives to force cache line boundaries.
Write contention occurs when multiple processors genuinely write to the same or nearby shared data, triggering constant invalidation messages as cache coherency protocols maintain consistency. Each write to shared data requires broadcasting invalidations to all other caches holding that line, marking their copies Invalid while the writing processor marks its copy Modified. When multiple processors compete to write, the cache line repeatedly bounces between processors’ caches, creating significant overhead from the constant coherency traffic. Unlike false sharing (where the contention is artificial), write contention reflects a fundamental limitation: cache coherency protocols become increasingly expensive as the number of processors writing to shared data increases. Minimizing write contention requires algorithmic approaches such as using per-thread local data that’s only merged periodically, employing lock-free data structures that reduce synchronization points, or redesigning algorithms to reduce the frequency of shared writes.

Non-Uniform Memory-Access (NUMA) System
SMP systems with bus snooping scale effectively up to around 8-16 processors, but beyond this point they encounter fundamental limitations. Excessive bus traffic from constant snooping and invalidation messages saturates the shared bus as processor count increases. Physical constraints compound the problem: at high clock speeds, wire length limitations and signal propagation delays (approaching the speed of light) become significant bottlenecks. Additionally, a single shared memory controller cannot provide sufficient bandwidth for many processors simultaneously accessing memory.
NUMA architectures address these scalability limitations by fundamentally changing the memory organization. Instead of a single shared memory accessible uniformly by all processors, NUMA systems physically distribute memory across multiple nodes, with each node containing one or more processors and its own local memory. Processors can still access any memory location in the system (maintaining a shared address space), but access times are now non-uniform: accessing local memory attached to the same node is fast, while accessing remote memory on another node requires traversing inter-node interconnects and is significantly slower. This distributed organization eliminates the single shared bus bottleneck and allows NUMA systems to scale to hundreds or thousands of processors.
NUMA systems maintain cache coherency through directory-based protocols rather than bus snooping. Each NUMA node contains directory hardware, typically integrated with or located near the memory controller, that tracks the state of memory blocks physically residing in that node. The directory maintains metadata for each cache line by storing a valid bit (indicating which processors currently have this cache line cached) for every processor and a dirty bit (indicating whether any processor has modified the data). These distributed directories communicate with each other over the interconnect network to coordinate cache coherency across all nodes in the system.
When a processor requests data, the request is routed to the home directory of the node where that memory physically resides. The directory checks its metadata: if the data is clean (i.e., the dirty bit not set), it provides the data directly and sets the valid bit for the requesting processor. If the data is dirty, the directory retrieves the updated value from the processor currently holding it, writes it back to main memory, and then provides it to the requester. When a processor writes to a cache line, the directory sends invalidate messages only to processors with valid bits set—those that actually have the line cached. This targeted invalidation contrasts with SMP systems, where processors broadcast invalidate messages to all processors on the shared bus, requiring every processor to check if the message is relevant to its own cache. The directory-based approach eliminates broadcast traffic, enabling NUMA systems to scale to hundreds or thousands of processors. For very large systems, hierarchical directory schemes employ multiple directory levels communicating over general-purpose interconnect networks rather than shared CPU buses.
For systems programmers, NUMA awareness matters primarily for performance optimization. Remote memory accesses can be 2-3x slower than local accesses, making memory locality critical. Tools like numactl allow binding processes and memory allocations to specific nodes, maximizing fast local memory accesses while minimizing expensive cross-node traffic.

Distributed Memory Multiprocessor Systems
Distributed memory multiprocessor systems are systems where each processor has private memory with no shared address space. Processors cannot directly read or write remote memory, so communication requires explicit message passing (e.g., MPI, gRPC).
This architecture scales to thousands or millions of nodes, making it well-suited for embarrassingly parallel workloads with high computation-to-communication ratios. However, network communication sits far below RAM in the memory hierarchy, making it poorly suited for tightly coupled algorithms requiring frequent data exchange.