## Multiprocessor Systems



COMP3231 04s1

#### Multiprocessor System

- · We will look at shared-memory multiprocessors
  - More than one processor sharing the same memory
- · A single CPU can only go so fast
  - Use more than one CPU to improve performance
  - Assumes
    - · Workload can be parallelised
    - Workload is not I/O-bound or memory-bound
- · Disks and other hardware can be expensive
  - Can share hardware between CPUs



COMP3231 04s1

2

#### Types of Multiprocessors (MPs)

- UMA MP
  - Uniform Memory Access
    - Access to all memory occurs at the same speed for all processors.
- NUMA MP
  - Non-uniform memory access
    - Access to some parts of memory is faster for some processors than other parts of memory
- · We will focus on UMA



COMP3231 04s1

# Bus Based UMA Simplest MP is more than one processor on a single bus connect to memory (a) - Bus bandwidth becomes a bottleneck with more than just a few CPUs



#### **Bus Based UMA**

- Each processor has a cache to reduce its need for access to memory (b)
  - Hope is most accesses are to the local cache
  - Bus bandwidth still becomes a bottleneck with many CPUs



#### **Cache Consistency**

 What happens if one CPU writes to address 0x1234 (and it is stored in its cache) and another CPU reads from the same address (and gets what is in its cache)?



### **Cache Consistency** · Cache consistency is usually handled by the hardware. - Writes to one cache propagate to, or invalidate appropriate entries on other caches - Cache transactions also consume bus bandwidth



#### **Bus Based UMA**

- · With only a single shared bus, scalability is limited by the bus bandwidth of the single
  - Caching only helps so much
- Alternative bus architectures do exist on high-end machines





#### **UMA Crossbar Switch** • Pro - Any CPU can access any available memory without blocking - Number of switches required scales with $n^2$ . • 1000 CPUs need 1000000 switches COMP3231 04s1 THE UNIVERSITY OF NEW SOUTH WALES

#### **Summary** · Multiprocessors can Increase computation power beyond that available from a single CPU - Share resources such as disk and memory - Shared buses (bus bandwidth) limit scalability • Can be reduced via hardware design Can be reduced by carefully crafted software behaviour - Good cache locality together with private data where possible - How do we construct an OS for a multiprocessor? · What are the issues? COMP3231 04s1

However

Question

THE UNIVERSITY OF NEW SOUTH WALES















#### Symmetric Multiprocessors (SMP) · OS kernel run on all processors Load and resource are balance between all processors · Including kernel execution · Issue: Real concurrency in the kernel - Need carefully applied synchronisation primitives to avoid disaster CPU 3 CPU 4 CPU 1 CPU 2 I/O Runs Runs Runs users and shared OS hared OS shared OS shared OS











#### Real life Scalability Example

• Students + assignment deadline = machine unusable



COMP3231 04s1

25

#### Real life Scalability Example

- To fix the problem, the tenderer supplied more CPUs to improve performance (number increased to 8)
  - No change????
- · Eventually, machine was replaced with
  - Three 2-CPU pizza-box-sized machines, each with 256M RAM
  - Cheaper overall
  - Performance was dramatically improved!!!!!
  - Why?



COMP3231 04s1

26

#### Real life Scalability Example

- · Paper:
  - Ramesh Balan and Kurt Gollhardt, "A Scalable Implementation of Virtual Memory HAT Layer for Shared Memory Multiprocessor Machines", Proc. 1992 Summer USENIX conference
- The 4-8 CPU machine hit a bottleneck in the single threaded VM code
  - Adding more CPUs simply added them to the wait queue for the VM locks
- · The 2 CPU machines did not generate that much lock contention and performed proportionally better.



COMP3231 04s1

27

#### Lesson Learned

- · Building scalable multiprocessor kernels is
- · Lock contention can limit overall system performance



THE UNIVERSITY OF NEW SOUTH WALES

COMP3231 04s1

#### **Multiprocessor Synchronisation**

- · Given we need synchronisation, how can we achieve it on a multiprocessor machine?
  - Unlike a uniprocessor, disabling interrupts does not work.
    - It does not prevent other CPUs from running in parallel
  - Need special hardware support



COMP3231 04s1

29

#### **Recall Mutual Exclusion** with Test-and-Set

enter\_region: TSL REGISTER,LOCK CMP REGISTER,#0

copy lock to register and set lock to 1 was lock zero? if it was non zero, lock was set, so loop

JNE enter\_region | if it |
RET | return to caller; critical region entered

leave\_region: MOVE LOCK,#0 RET | return to caller

store a 0 in lock

Entering and leaving a critical region using the TSL instruction

THE UNIVERSITY OF NEW SOUTH WALES



31

COMP3231 04s1

THE UNIVERSITY OF NEW SOUTH WALES











#### **Reducing Bus Contention** If lock fails, use exponential back-off algorithm start: - Wait a longer and longer time r = TSL(lock)before trying again · 2 instructions, 4, 8, 16 etc $if (r == 1) {$ - Up to a maximum exp\_wait(); - High maximum gives lower bus contention goto start; - Low maximum gives more responsive locks Cache Cache THE UNIVER: NEW SOUTH



#### Other Hardware Provided SMP Synchronisation Primitives

- · Atomic Add/Subtract
  - Can be used to implement counting semaphores
- Exchange
- · Compare and Exchange
- · Load linked; Store conditionally
  - Two separate instructions
    - Load value using load linked
    - · Modify, and store using store conditionally
    - If value changed by another processor, or an interrupt occurred, then store conditionally failed
  - Can be used to implement all of the above primitives
  - Implemented without bus locking



COMP3231 04s1

39

41

#### Spinning versus Switching

- Remember spinning (busy-waiting) on a lock made little sense on a uniprocessor
  - The was no other running process to release the lock
  - Blocking and (eventually) switching to the lock holder is the only option.
- On SMP systems, the decision to spin or block is not as clear.
  - The lock is held by another running CPU and will be freed without necessarily blocking the requestor



COMP3231 04s1

#### Spinning versus Switching

- Blocking and switching
  - · to another process takes time
    - Save context and restore another
    - Cache contains current process not new
      - » Adjusting the cache working set also takes time
    - TLB is similar to cache
  - Switching back when the lock is free encounters the same again
- Spinning wastes CPU time directly
- · Trade off
  - If lock is held for less time than the overhead of switching to and back
  - ⇒ It's more efficient to spin



COMP3231 04s1

Spinning versus Switching

- · The general approaches taken are
  - Spin forever
  - Spin for some period of time, if the lock is not acquired, block and switch
    - The spin time can be
      - Fixed (related to the switch overhead)
      - Dynamic
        - » Based on previous observations of the lock acquisition time



COMP3231 04s1

ls1 42

#### **Preemption and Spinlocks**

- Critical sections synchronised via spinlocks are expected to be short
  - Avoid other CPUs wasting cycles spinning
- What happens if the spinlock holder is preempted at end of holder's timeslice
  - Mutual exclusion is still guaranteed
  - Other CPUs will spin until the holder is scheduled again!!!!!
- ⇒ Spinlock implementations disable interrupts in addition to acquiring locks to avoid lock-holder preemption



COMP3231 04s1

43

#### Multiprocessor Scheduling

- Given X processes (or threads) and Y CPUs.
  - how do we allocate them to the CPUs



COMP3231 04s1

1 04s1

#### A Single Shared Ready Queue

 When a CPU goes idle, it take the highest priority process from the shared ready queue



#### Single Shared Ready Queue

- Pros
  - Simple
  - Automatic load balancing
- Cons
  - Lock contention on the ready queue can be a major bottleneck
    - Due to frequent scheduling or many CPUs or both
  - Not all CPUs are equal
    - The last CPU a process ran on is likely to have more related entries in the cache.



COMP3231 04s1

**Affinity Scheduling** 

- · Basic Idea
  - Try hard to run a process on the CPU it ran on last time
- · One approach: Two-level scheduling



COMP3231 04s1

47

#### Two-level Scheduling

- · Each CPU has its own ready queue
- Top-level algorithm assigns process to CPUs

   Defines their affinity, and roughly balances the load
- · The bottom-level scheduler:
  - Is the frequently invoked scheduler (e.g. on blocking on I/O, a lock, or exhausting a timeslice)
  - Runs on each CPU and selects from its own ready queue
    - Ensures affinity
  - If nothing is available from the local ready queue, it runs a process from another CPUs ready queue rather than go idle



COMP3231 04s1

48

#### Two-level Scheduling

- Pros
  - No lock contention on per-CPU ready queues in the (hopefully) common case
  - Load balancing to avoid idle queues
  - Automatic affinity to a single CPU for more cache friendly behaviour



COMP3231 04s1

49

51



#### **Gang Scheduling**

- · The approach
  - Groups of related threads are scheduled together as a gang
  - All members of the gang run simultaneously, on different CPUs
  - All gang members start and end their time slice together



COMP3231 04s1

**Gang Scheduling** CPU 0 2 3 5 A₁  $A_{2}$  $A_4$ A<sub>5</sub> A<sub>3</sub> B В, B  $C_0$ C, C, D<sub>1</sub>  $D_2$  $D_3$  $D_4$ Do  $E_0$ E, E, E<sub>5</sub> E  $E_3$  $E_4$ Time 3 slot A<sub>o</sub> A, A,  $A_3$  $A_4$ A<sub>5</sub> Bo B<sub>1</sub> В Co C, C<sub>2</sub>  $D_0$ D<sub>1</sub>  $D_2$  $D_3$  $D_4$  $E_0$ 6 E, E<sub>2</sub>  $E_3$  $E_4$ E<sub>5</sub> E<sub>6</sub>