Note for 《Operating Systems Three Easy Pieces》 Part 2 Concurrency

Note for 《Operating Systems Three Easy Pieces》 Part 2 Concurrency

[TOC]

本文为《Operating Systems Three Easy Pieces》by Remzi H Arpaci-Dusseau / Andrea C Arpaci-Dusseau
第三部分中并发的学习笔记.

Chapter 26 Concurrency: An Introduction

本章概要

  • 引入线程, 介绍其与进程的区别.
  • 介绍引起并发问题的一些因素 shared data, uncontrolled scheduling, waiting for another, 并介绍各种术语的定义.
  • 为什么在 OS 的课程上学习并发.

26.0 Intro

  • 如何理解线程 thread

    • Each thread is very much like a separate process, except for one difference: they share the same address space and thus can access the same data.
    • If there are two threads that are running on a single processor, when switching from running one (T1) to running the other (T2), a context switch must take place.
  • 线程与进程的区别

    • Instead of process control block (PCB); now, we’ll need one or more thread control blocks (TCBs) to store the state of each thread of a process.

    • One other major difference between threads and processes concerns the stack.

      • any stack-allocated variables, parameters, return values, and other things that we put on the stack will be placed in what is sometimes called thread-local storage.
      • stacks do not generally have to be very large (the exception being in programs that make heavy use of recursion).

26.1 An Example: Thread Creation

C code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#include "common.h"
#include "common_threads.h"

void *mythread(void *arg) {
printf("%s\n", (char *) arg);
return NULL;
}

int main(int argc, char *argv[]) {
if (argc != 1) {
fprintf(stderr, "usage: main\n");
exit(1);
}

pthread_t p1, p2;
printf("main: begin\n");
Pthread_create(&p1, NULL, mythread, "A");
Pthread_create(&p2, NULL, mythread, "B");
// join waits for the threads to finish
Pthread_join(p1, NULL);
Pthread_join(p2, NULL);
printf("main: end\n");
return 0;
}

three possible ordering

  1. Thread Trace (1)
$$ \begin{array}{c|c|c} \text { main } & \text { Thread 1 } & \text { Thread2 } \\ \hline \text { starts running } & & \\ \text { prints “main: begin” } & & \\ \text { creates Thread 1 } & & \\ \text { creates Thread 2 } & & \\ \text { waits for T1 } & &\\ & \text { runs } &\\ & \text { prints “A” } &\\ & \text { returns }& \\ \text { waits for T2 } & & \\ &&\text { runs } \\ &&\text { prints “B” }\\ && \text { returns }\\ \text{prints “main: end”} && \end{array} $$
  1. Thread Trace (2)
$$ \begin{array}{c|c|c} \text { main } & \text { Thread 1 } & \text { Thread2 } \\ \hline \text { starts running } & & \\ \text { prints “main: begin” } & & \\ \text { creates Thread 1 } & & \\ & \text { runs } &\\ & \text { prints “A” } & \\ & \text { returns } & \\ \text { creates Thread 2 } & & \\ & & \text { runs } \\ & & \text { prints “B” } \\ & & \text { returns } \\ \text { waits for T1 } & & \\ \quad \textit{returns immediately; T1 is done} & & \\ \text { waits for T2 } & & \\ \quad \textit{returns immediately; T2 is done} & & \\ \text { prints “ main:end”} & & \\ \end{array} $$
  1. Thread Trace (3)
$$ \begin{array}{c|c|c} \text { main } & \text { Thread 1 } & \text { Thread2 } \\ \hline \text { starts running } & & \\ \text { prints “main: begin” } & & \\ \text { creates Thread 1 } & & \\ \text { creates Thread 2 } & & \\ & & \text { runs } \\ & & \text { prints “B” } \\ & & \text { returns } \\ \text { waits for T1 } & & \\ & \text { runs } &\\ & \text { prints “A” } & \\ & \text { returns } & \\ \text { waits for T2 } & & \\ \quad \textit{returns immediately; T2 is done} & & \\ \text { prints “ main:end”} & & \\ \end{array} $$

26.2 Why It Gets Worse: Shared Data

two threads wish to update a global shared variable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#include "common.h"
#include "common_threads.h"

int max;
volatile int counter = 0; // shared global variable

void *mythread(void *arg) {
char *letter = arg;
int i; // stack (private per thread)
printf("%s: begin [addr of i: %p]\n", letter, &i);
for (i = 0; i < max; i++) {
counter = counter + 1; // shared: only one
}
printf("%s: done\n", letter);
return NULL;
}

int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "usage: main-first <loopcount>\n");
exit(1);
}
max = atoi(argv[1]);

pthread_t p1, p2;
printf("main: begin [counter = %d] [%x]\n", counter,
(unsigned int) &counter);
Pthread_create(&p1, NULL, mythread, "A");
Pthread_create(&p2, NULL, mythread, "B");
// join waits for the threads to finish
Pthread_join(p1, NULL);
Pthread_join(p2, NULL);
printf("main: done\n [counter: %d]\n [should: %d]\n",
counter, max*2);
return 0;
}

outcome

1
2
3
4
5
6
7
8
prompt> ./main
main: begin (counter = 0)
A: begin
B: begin
A: done
B: done
main: done with both (counter = 19345221)
# expected: main: done with both (counter = 20000000)

26.3 The Heart of the Problem: Uncontrolled Scheduling

一些基础概念的引入:

  • race condition: the results depend on the timing execution of the code.
  • instead of a nice deterministic computation (which we are used to from computers), we call this result indeterminate, where it is not known what the output will be and it is indeed likely to be different across runs.
  • A critical section is a piece of code that accesses a shared variable (or more generally, a shared resource) and must not be concurrently executed by more than one thread.
  • mutual exclusion: This property guarantees that if one thread is executing within the critical section, the others will be prevented from doing so.
  • Virtually all of these terms, by the way, were coined by Edsger Dijkstra, who was a pioneer in the field and indeed won the Turing Award because of this and other work.

26.4 The Wish For Atomicity

It could not be interrupted mid-instruction, because that is precisely the guarantee we receive from the hardware: when an interrupt occurs, either the instruction has not run at all, or it has run to completion; there is no in-between state.
what we will instead do is ask the hardware for a few useful instructions upon which we can build a general set of what we call synchronization primitives.

硬件提供构建同步命令的指令, 例如, 如何构建互斥锁.

THE CRUX:
HOW TO PROVIDE SUPPORT FOR SYNCHRONIZATION
What support do we need from the hardware in order to build useful synchronization primitives? What support do we need from the OS? How can we build these primitives correctly and efficiently? How can programs use them to get the desired results?

26.5 One More Problem: Waiting For Another

存在的问题:

  • Support for synchronization primitives to support atomicity: one type of interaction occurs between threads, that of accessing shared variables and the need to support atomicity for critical sections.
  • Sleeping/waking interaction: there is another common interaction that arises, where one thread must wait for another to complete some action before it continues.

26.6 Summary: Why in OS Class?

原因的回答: The OS was the first concurrent program, and many techniques were created for use within the OS.

TIP: USE ATOMIC OPERATIONS

  • The idea behind making a series of actions atomic is simply expressed with the phrase “all or nothing”; it should either appear as if all of the actions you wish to group together occurred, or that none of them occurred, with no in-between state visible.
  • the grouping of many actions into a single atomic action is called a transaction.

Chapter 27 Interlude: Thread API

本章概要:

  • POSIX 标准下如何创建线程, 结束线程.
  • POSIX 标准下 lock 与 condition variable 的创建与使用.

27.0 Intro

CRUX: HOW TO CREATE AND CONTROL THREADS
What interfaces should the OS present for thread creation and control? How should these interfaces be designed to enable ease of use as well as utility?

27.1 Thread Creation(POSIX)

  • 接口:

    1
    2
    3
    4
    5
    6
    #include <pthread.h>
    int
    pthread_create( pthread_t * thread,
    const pthread_attr_t * attr,
    void * (*start_routine)(void*),
    void * arg);
  • 参数解释

      1. thread, is a pointer to a structure of type pthread_t.
      1. attr, is used to specify any attributes this thread might have.
      • An attribute is initialized with a separate call to pthread_attr_init().
      1. which function should this thread start running in(a function pointer) ==> task to be run.
      1. arg, is exactly the argument to be passed to the function where the thread begins execution. 设置技巧如下:
      • Having a void pointer as an argument to the function start routine allows us to pass in any type of argument.
      • Having it as a return value allows the thread to return any type of result.

27.2 Thread Completion

  • join: to wait for a thread to complete.

  • 接口:

    1
    int pthread_join(pthread_t thread, void **value_ptr);
  • 参数解释

    • pthread_tis used to specify which thread to wait for.

    • a pointer to the return value you expect to get back.

      • **的原因:
        Because the routine can return anything, it is defined to return a pointer to void; because the pthread_join() routine changes the value of the passed in argument, you need to pass in a pointer to that value, not just the value itself.
  • implementation example code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    #include <stdio.h>
    #include <stdlib.h>
    #include <pthread.h>
    #include "common_threads.h"

    typedef struct {
    int a;
    int b;
    } myarg_t;

    typedef struct {
    int x;
    int y;
    } myret_t;

    void *mythread(void *arg) {
    myarg_t *args = (myarg_t *) arg;
    printf("args %d %d\n", args->a, args->b);
    myret_t *rvals = malloc(sizeof(myret_t));
    assert(rvals != NULL);
    rvals->x = 1;
    rvals->y = 2;
    return (void *) rvals;
    }

    int main(int argc, char *argv[]) {
    pthread_t p;
    myret_t *rvals;
    myarg_t args = { 10, 20 };
    Pthread_create(&p, NULL, mythread, &args);
    Pthread_join(p, (void **) &rvals);
    printf("returned %d %d\n", rvals->x, rvals->y);
    free(rvals);
    return 0;
    }

Note:

  1. If we just create a thread with no arguments, we can pass NULL in as an argument when the thread is created.

  2. If we are just passing in a single value (e.g., an int), we don’t have to package it up as an argument.

  3. We should note that one has to be extremely careful with how values are returned from a thread. In particular, never return a pointer which refers to something allocated on the thread’s call stack. ==> 导致空悬

  4. Not all code that is multi-threaded uses the join routine.

    • For example, a multi-threaded web server might create a number of worker threads, and then use the main thread to accept requests and pass them to the workers, indefinitely. Such long-lived programs thus may not need to join. => detach 即可

    • C++ 中另一种结束方式 detach. 主调线程继续运行, 被调线程驻留后台运行, 主调线程无法再取得该被调线程的控制权. 当主调线程结束时, 由运行时库负责清理与被调线程相关的资源.

27.3 Locks

接口:

1
2
int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);

建立 lock

When you have a region of code you realize is a critical section, and thus needs to be protected by locks in order to operate as desired.

有问题的模式:

1
2
3
4
pthread_mutex_t lock;
pthread_mutex_lock(&lock);
x = x + 1; // or whatever your critical section is
pthread_mutex_unlock(&lock);

this code is broken, in two important ways.

  1. The first problem is a lack of proper initialization. All locks must be properly initialized in order to guarantee that they have the correct values to begin with and thus work as desired when lock and unlock are called.

With POSIX threads, there are two ways to initialize locks.

  • One way to do this is to use PTHREAD_MUTEX_INITIALIZER, as follows:
1
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

sets the lock to the default values and thus makes the lock usable.

  • The dynamic way to do it (i.e., at run time) is to make a call to pthread_mutex_init().
1
2
int rc = pthread_mutex_init(&lock, NULL);
assert(rc == 0); // always check success!

代码解释

  • The first argument to this routine is the address of the lock itself, whereas the second is an optional set of attributes.
  • Note that a corresponding call to pthread_cond_destroy() should also be made, when you are done with the lock.
  1. The second problem with the code above is that it fails to check errors code when calling lock and unlock.

PS. C++ 里提供了 std::lock_guard, std::unique_lock, std::scoped_lock 来管理这些, 省事了不少.

lock acquisition

接口:

1
2
3
int pthread_mutex_trylock(pthread_mutex_t *mutex);
int pthread_mutex_timedlock(pthread_mutex_t *mutex,
struct timespec *abs_timeout);

代码解释

  • The trylock version returns failure if the lock is already held.
  • The timedlock version of acquiring a lock returns after a timeout or after acquiring the lock, whichever happens first.
  • If no other thread holds the lock when pthread_mutex_lock() is called, the thread will acquire the lock and enter the critical section. If another thread does indeed hold the lock, the thread trying to grab the lock will not return from the call until it has acquired the lock.

27.4 Condition Variables

接口:

1
2
int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex);
int pthread_cond_signal(pthread_cond_t *cond);

To use a condition variable, one has to in addition have a lock that is associated with this condition. When calling either of the above routines, this lock should be held.

  • pthread_cond_wait(), puts the calling thread to sleep, and thus waits for some other thread to signal it.

a typical usage:

1
2
3
4
5
6
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t init = PTHREAD_COND_INITIALIZER;
Pthread_mutex_lock(&lock);
while (initialized == 0)
Pthread_cond_wait(&init, &lock);
Pthread_mutex_unlock(&lock);

The code to wake a thread:

1
2
3
4
Pthread_mutex_lock(&lock);
initialized = 1;
Pthread_cond_signal(&init);//此处其实也可以考虑先 unlock 再 signal, 提升一些效率
Pthread_mutex_unlock(&lock);

代码解释

  • First, when signaling (as well as when modifying the global variable initialized), we always make sure to have the lock held.
  • Second, you might notice that the wait call takes a lock as its second parameter, whereas the signal call only takes a condition. The reason for this difference is that the wait call, in addition to putting the calling thread to sleep, releases the lock when putting said caller to sleep.
  • One last oddity: the waiting thread rechecks the condition in a while loop, instead of a simple if statement. 后面会讲原因.

反面教材:

Note that sometimes it is tempting to use a simple flag to signal between two threads, instead of a condition variable and associated lock. Don’t ever do this, for the following reasons.

1
2
while (initialized == 0)
; // spin
  • First, it performs poorly in many cases (spinning for a long time just wastes CPU cycles).
  • Second, it is error prone.

27.5 Compiling and Running

  • Include the header pthread.h.
  • Adding the -pthread flag.

27.6 Summary

type man -k pthread on a Linux system to see over one hundred APIs that make up the entire interface.

Chapter 28 Locks

本章概要:

  • 介绍如何构建一个可用的 lock, 包括硬件支持(atomic 操作 API)以及 implementation 思路.
  • 硬件的 atomic 操作支持包括: test-and-set, compare-and-swap, load-linked-and-store-conditional, 以及 yiled 等.
  • implementation 的思路有: spinning(等待时 while 空转), ticket lock, 使用 queue, two-phase 等.
  • 介绍如何评价分析 lock 的表现如何.

28.1 Locks: The Basic Idea

  • example

    1
    2
    3
    4
    5
    lock_t mutex; // some globally-allocated lock ’mutex’
    ...
    lock(&mutex);
    balance = balance + 1;
    unlock(&mutex);
  • lock 的状态: A lock is just a variable. It is either available (or unlocked or free) and thus no thread holds the lock, or acquired (or locked or held).

  • lock 的本质: Locks provide some minimal amount of control over scheduling to programmers. In general, we view threads as entities created by the programmer but scheduled by the OS, in any fashion that the OS chooses.

28.2 Pthread Locks

  • 语源: a mutex, as it is used to provide mutual exclusion between threads.

POSIX threads 接口

1
2
3
4
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
Pthread_mutex_lock(&lock); // wrapper for pthread_mutex_lock()
balance = balance + 1;
Pthread_mutex_unlock(&lock);

POSIX version passes a variable to lock and unlock, as we may be using different locks to protect different variables.

  • lock 对 shared data 大小的策略区分(粒度大小): Instead of one big lock that is used any time any critical section is accessed (a coarse-grained locking strategy), one will often protect different data and data structures with different locks, thus allowing more threads to be in locked code at once (a more fine-grained approach). => 但是随之而来的线程间的依赖关系而导致的同步问题需要仔细设计.

28.3 Building A Lock

The Crux: HOW TO BUILD A LOCK
How can we build an efficient lock?
Efficient locks provided mutual exclusion at low cost, and also might attain a few other properties we discuss below.
What hardware support is needed?
What OS support?

28.4 Evaluating Locks

some basic criteria

  1. correctness: to provide mutual exclusion.
  2. fairness: Does each thread contending for the lock get a fair shot at acquiring it once it is free?
  3. performance: specifically the time overheads added by using the lock. 考虑三种情况下.
  • 3.1 无竞争下: the case of no contention; when a single thread is running and grabs and releases the lock.
  • 3.2 单核竞争下: Another is the case where multiple threads are contending for the lock on a single CPU.
  • 3.3 多核竞争下: How does the lock perform when there are multiple CPUs involved, and threads on each contending for the lock?

28.5 Controlling Interrupts

本节介绍简单粗暴的方法: To disable interrupts for critical sections. this solution was invented for single-processor systems.

思路伪代码:

1
2
3
4
5
6
void lock() {
DisableInterrupts();
}
void unlock() {
EnableInterrupts();
}

优缺点分析

  • main positive: simplicity
  • negatives
  1. this approach requires us to allow any calling thread to perform a privileged operation (turning interrupts on and off), and thus trust that this facility is not abused.
    • 1.1. a greedy program could call lock() at the beginning of its execution and thus monopolize the processor;
    • 1.2. an errant or malicious program could call lock() and go into an endless loop.
    • 1.3. the OS never regains control of the system, and there is only one recourse: restart the system.
  2. the approach does not work on multiprocessors.
    If multiple threads are running on different CPUs, and each try to enter the same critical section, it does not matter whether interrupts are disabled; threads will be able to run on other processors, and thus could enter the critical section.
  3. Third, and probably least important, this approach can be inefficient.

应用场景

Turning off interrupts is only used in limited contexts as a mutual-exclusion primitive.

For example, in some cases an operating system itself will sometimes use interrupt masking to guarantee atomicity when accessing its own data structures

28.6 Test And Set (Atomic Exchange)

本节介绍比较简单的思路.

test-and-set instruction, also known as atomic exchange.

思路介绍

C 伪代码示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
typedef struct __lock_t { int flag; } lock_t;

void init(lock_t *mutex) {// 0 -> lock is available, 1 -> held
mutex->flag = 0;
}

void lock(lock_t *mutex) {
while (mutex->flag == 1)// TEST the flag
; // spin-wait (do nothing)
mutex->flag = 1;// now SET it!
}

void unlock(lock_t *mutex) {
mutex->flag = 0;
}

主要思想: Use a simple variable to indicate whether some thread has possession of a lock.

  • The first thread that enters the critical section will call lock(), which tests whether the flag is equal to 1 (in this case, it is not), and then sets the flag to 1 to indicate that the thread now holds the lock. When finished with the critical section, the thread calls unlock() and clears the flag.
  • If another thread happens to call lock() while that first thread is in the critical section, it will simply spin-wait in the while loop for that thread to call unlock() and clear the flag.

问题

two problems

  1. correctness

如下图可能出现 2 个线程同时进入 critical section 的问题.

$$ \begin{array}{c|c} \text { Thread 1 } & \text { Thread 2 } \\ \hline \text { call lock }() & \\ \text { while (flag ==1)} & \\ \text { interrupt: switch to Thread } 2 & \\ & \text { call lock () } \\ & \text { while (flag == 1)} \\ & \text { flag }=1 ; \\ \text { interrupt: switch to Thread } 1 \end{array} $$
  1. performance
    Spin-waiting wastes time waiting for another thread to release a lock.

TIP: THINK ABOUT CONCURRENCY AS MALICIOUS SCHEDULER.
Although the exact sequence of interrupts may be improbable, it is possible.

28.7 Building A Working Spin Lock

介绍对上节思路改进的实际应用.

实际应用举例: on SPARC, it is the load/store unsigned byte instruction (ldstub), whereas on x86, it is the atomic exchange instruction (xchg)– but basically does the same thing across platforms, and is usually generally referred to as test-and-set.

C伪代码示例:

1
2
3
4
5
int TestAndSet(int *ptr, int new) {
int old = *ptr; // fetch old value at ptr
*ptr = new; // store ’new’ into ptr
return old; // return the old value
}

主要思想: The key is that this sequence of operations is performed atomically. The reason it is called “test and set” is that it enables you to “test” the old value (which is what is returned) while simultaneously “setting” the memory location to a new value; as it turns out, this slightly more powerful instruction is enough to build a simple spin lock.

implementation

A Simple Spin Lock Using Test-and-set

C 伪代码示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
typedef struct __lock_t {
int flag;
} lock_t;

void init(lock_t *lock) {
// 0 indicates that lock is available, 1 that it is held
lock->flag = 0;
}

void lock(lock_t *lock) {
while (TestAndSet(&lock->flag, 1) == 1)
; // spin-wait (do nothing)
}


void unlock(lock_t *lock) {
lock->flag = 0;
}

具体执行过程:

  1. first the case where a thread calls lock() and no other thread currently holds the lock; thus, flag should be 0. When the thread then calls TestAndSet(flag,1), the routine will return the old value of flag, which is 0; thus, the calling thread, which is testing the value of flag, will not get caught spinning in the while loop and will acquire the lock. The thread will also atomically set the value to 1, thus indicating that the lock is now held.

  2. The second case we can imagine arises when one thread already has the lock held (i.e., flag is 1). In this case, this thread will call lock() and then call TestAndSet(flag, 1) as well. This time, TestAndSet() will return the old value at flag, which is 1 (because the lock is held), while simultaneously setting it to 1 again. As long as the lock is held by another thread, TestAndSet() will repeatedly return 1, and thus this thread will spin and spin until the lock is finally released.

特点

  • It is the simplest type of lock to build, and simply spins, using CPU cycles, until the lock becomes available.
  • it requires a preemptive(抢占式) scheduler (i.e., one that will interrupt a thread via a timer, in order to run a different thread, from time to time 时不时地排列所有可能的线程来运行). Without preemption, spin locks don’t make much sense on a single CPU.

28.8 Evaluating Spin Locks

  • correctness: yes

  • fairness

    • A thread spinning may spin forever, under contention.
  • performance

    • In the single CPU case, performance overheads can be quite painful.

      • Imagine the case where the thread holding the lock is pre-empted within a critical section. The scheduler might then run every other thread (imagine there are $N − 1$ others), each of which tries to acquire the lock. In this case, each of those threads will spin for the duration of a time slice before giving up the CPU, a waste of CPU cycles.
    • On multiple CPUs, spin locks work reasonably well (if the number of threads roughly equals the number of CPUs).

      • Imagine Thread A on CPU 1 and Thread B on CPU 2, both contending for a lock. If Thread A (CPU 1) grabs the lock, and then Thread B tries to, B will spin (on CPU 2). However, presumably the critical section is short, and thus soon the lock becomes available, and is acquired by Thread B. Spinning to wait for a lock held on another processor doesn’t waste many cycles in this case, and thus can be quite effective.

书外的调查

  • test-and-set 的实现
    A CPU may use a test-and-set instruction offered by another electronic component, such as dual-port RAM; a CPU itself may also offer a test-and-set instruction.

  • Test and test-and-set
    为了解决 test-and-set 的 2 个问题: test-and-set can lead to resource contention in busy lock (caused by bus locking and cache invalidation when test-and-set operation needs to access memory atomically.

实现的伪代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
boolean locked := false // shared lock variable

//Entry protocol
procedure EnterCritical() {
do {
while ( locked == true ) skip // spin until the lock seems free
} while ( TestAndSet(locked) == true ) // attempt actual atomic locking using the test-and-set instruction
}

//Exit protocol
procedure ExitCritical() {
locked := false
}

解决策略: Test-and-set is only used to try to get the lock when normal memory read says it’s free. Thus the expensive atomic memory operations happen less often than in a simple spin around test-and-set.

尤其是对支持 short-circuit evaluation 的语言上可以设计成如下:

1
2
3
4
procedure EnterCritical() {
while ( locked == true or TestAndSet(locked) == true )
skip // spin until locked
}

问题: 与 double-checked locking 模式一样, 不是百分百稳定的, can be an anti-pattern.

28.9 Compare-And-Swap

对上节进行改进.
compare-and-swap instruction (as it is called on SPARC, for example), or compare-and-exchange (as it called on x86).

C pseudocode:

1
2
3
4
5
6
int CompareAndSwap(int *ptr, int expected, int new) {
int actual = *ptr;
if (actual == expected)
*ptr = new;
return actual;
}

改进做法: replace the lock() routine above with the following:

1
2
3
4
void lock(lock_t *lock) {
while (CompareAndSwap(&lock->flag, 0, 1) == 1)
; // spin
}

书外的调查

用处: used to implement synchronization primitives like semaphores and mutexes, as well as more sophisticated lock-free and wait-free algorithms.

问题: ABA problem

the problem of a false positive match. The problem arises if this new bit pattern, which looks exactly like the old value, has a different meaning: for instance, it could be a recycled address, or a wrapped version counter.

ABA 解决方案1:

A general solution to this is to use a double-length CAS (DCAS). E.g., on a 32-bit system, a 64-bit CAS can be used. The second half is used to hold a counter. The compare part of the operation compares the previously read value of the pointer and the counter, with the current pointer and counter.
极小概率的漏洞是 the counter is exceedingly unlikely to be the same (for a 32-bit value, a multiple of $2^{32}$ operations would have to have occurred, causing the counter to wrap and at that moment, the pointer value would have to also by chance be the same).

ABA 解决方案2:

One simple technique which helps alleviate this problem is to store an ABA counter in each data structure element, rather than using a single ABA counter for the whole data structure.

ABA 解决方案3:

A more complicated but more effective solution is to implement safe memory reclamation (SMR). This is in effect lock-free garbage collection. The advantage of using SMR is the assurance a given pointer will exist only once at any one time in the data structure, thus the ABA problem is completely solved.

28.10 Load-Linked and Store-Conditional(LL/SC)

Load-Linked and Store-Conditional: MIPS. Alpha, PowerPC, and ARM provide similar instructions. This implements a lock-free atomic read-modify-write operation.

硬件提供的支持, C pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
int LoadLinked(int *ptr) {
return *ptr;
}

int StoreConditional(int *ptr, int value) {
if (no one has updated *ptr since the LoadLinked to this address) {
*ptr = value;
return 1; // success!
} else {
return 0; // failed to update
}
}

Implementation 举例

a lock using load-linked and store-conditional.

1
2
3
4
5
6
7
8
9
10
11
12
13
void lock(lock_t *lock) {
while (1) {
while (LoadLinked(&lock->flag) == 1)
; // spin until it’s zero
if (StoreConditional(&lock->flag, 1) == 1)
return; // if set-it-to-1 was a success: all done
// otherwise: try it all over again
}
}

void unlock(lock_t *lock) {
lock->flag = 0;
}

An equivalent implementation example. It certainly is shorter.

1
2
void lock(lock_t *lock) {
while (LoadLinked(&lock->flag)||!StoreConditional(&lock->flag, 1)) ; // spin}

TIP: LESS CODE IS BETTER CODE (LAUER’S LAW).
“If the same people had twice as much time, they could produce as good of a system in half the code.” We’ll call this Lauer’s Law.

书外的调查

sometimes known as load-reserved/store-conditional(LR/SC).

  • Comparison of LL/SC and compare-and-swap

本质上的机理的区别:
If any updates have occurred, the store-conditional is guaranteed to fail, even if the value read by the load-link has since been restored(即便是同一个值被再存一次也被认为被 update). As such, an LL/SC pair is stronger than a read followed by a compare-and-swap (CAS)(ABA 问题).

对 update 的判断非常灵敏: Any exceptional events between the two operations, such as a context switch, another load-link, or even (on many platforms) another load or store operation, will cause the store-conditional to spuriously fail. Older implementations will fail if there are any updates broadcast over the memory bus. This is called weak LL/SC(过于敏感的 LL/SC) by researchers, as it breaks many theoretical LL/SC algorithms. Weakness is relative, and some weak implementations can be used for some algorithms.

在 C++ 里有 compare_exchange_weak(), and compare_exchange_strong() 成员函数, 底层的逻辑即为对应的 weak LL/SC 与 strong LL/SC 的概念. 因此 compare_exchange_weak() 常被运用在 while 循环里.

LL/SC has two advantages over CAS when designing a load–store architecture:

In LL/SC, reads and writes are separate instructions, as required by the design philosophy (and pipeline architecture); and both instructions can be performed using only two registers(address and value), fitting naturally into common 2-operand ISAs. CAS, on the other hand, requires three registers (address, old value, new value) and a dependency between the value read and the value written. x86, being a CISC architecture, does not have this constraint; though modern chips may well translate a CAS instruction into separate LL/SC micro-operations internally.

  • Hardware LL/SC implementations typically do not allow nesting of LL/SC pairs.

28.11 Fetch-And-Add

定义: the fetch-and-add CPU instruction (FAA) atomically increments the contents of a memory location by a specified value.
硬件支持 C pseudocode:

1
2
3
4
5
int FetchAndAdd(int *ptr) {
int old = *ptr;
*ptr = old + 1;
return old;
}

基于其上的 mutual exclusion lock implementation 思路: ticket lock

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
typedef struct __lock_t {
int ticket;
int turn;
} lock_t;

void lock_init(lock_t *lock) {
lock->ticket = 0;
lock->turn = 0;
}


void lock(lock_t *lock) {
int myturn = FetchAndAdd(&lock->ticket);
while (lock->turn != myturn)
; // spin
}


void unlock(lock_t *lock) {
FetchAndAdd(&lock->turn);
}

主要思想
when a thread wishes to acquire a lock, it first does an atomic fetch-and-add on the ticket value; that value is now considered this thread’s “turn” (myturn). The globally shared lock->turn is then used to determine which thread’s turn it is; when (myturn == turn) for a given thread, it is that thread’s turn to enter the critical section. Unlock is accomplished simply by incrementing the turn such that the next waiting thread (if there is one) can now enter the critical section.

Note one important difference with this solution versus our previous attempts: it ensures progress for all threads.

书外的调查

  • motivation

$$
x = x + a
$$

are not safe in a concurrent system, where multiple processes or threads are running concurrently (either in a multi-processor system, or preemptively scheduled onto some single-core systems). The reason is that such an operation is actually implemented as multiple machine instructions:

  • load $x$ into a register;
  • add $a$ to register;
  • store register value back into $x$.

When one process is doing $x = x + a$ and another is doing $x = x + b$ concurrently, there is a race condition. They might both fetch $x_{old}$ and operate on that, then both store their results with the effect that one overwrites the other and the stored value becomes either $x_{old} + a$ or $x_{old} + b$, not $x_{old} + a + b$ as might be expected.

  • An atomic fetch_add function appears in the C++11 standard.

std::atomic<T>::fetch_add:

1
2
3
4
T* fetch_add( std::ptrdiff_t arg,
std::memory_order order = std::memory_order_seq_cst ) noexcept;
T* fetch_add( std::ptrdiff_t arg,
std::memory_order order = std::memory_order_seq_cst ) volatile noexcept;
  • C implementation for the GCC compiler, for both 32- and 64-bit x86 Intel platforms, based on extended asm syntax:
1
2
3
4
5
6
7
8
9
static inline int fetch_and_add(int* variable, int value)
{
__asm__ volatile("lock; xaddl %0, %1"
: "+r" (value), "+m" (*variable) // input + output
: // No input-only
: "memory"
);
return value;
}

A ticket lock is a first in first out (FIFO) queue-based mechanism. It adds the benefit of fairness of lock acquisition

应用: In a Non-Uniform Memory Architecture (NUMA) system it is important to have a lock implementation that guarantees some level of fairness of lock acquisition. The ticket lock is an implementation of spinlock that adds this desired attribute.

  • Advantages

    • fair, thus preventing starvation.
    • prevents the thundering herd problem occurring since only one thread at a time tries to enter the critical section.
    • Storage is not necessarily a problem as all threads spin on one variable, unlike array-based queueing locks (ABQL).
  • Disadvantages

    • a higher uncontended latency due to the extra instructions required to read and test the value that all threads are spinning on before a lock is released.
    • non-scalable. Research indicated that as processors are added to a Ticket Lock system, performance appears to decay exponentially.
    • Another problem comes from releasing a lock. All threads are spinning on one variable, so when the lock is released there are $\theta (p)$ invalidations (as well as $\theta (p)$ acquisitions).
  • Thundering herd problem

the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event. When the processes wake up, they will each try to handle the event, but only one will win. All processes will compete for resources, possibly freezing the computer, until the herd is calmed down again. => 负载极端波动.

Array Based Queuing Locks(ABQL)

an extension to the ticket lock algorithm which ensures that, on a lock release, only one processor attempts to acquire the lock.

  • Implementation
    This is achieved with an array of length equal to the number of threads which are in contention for the lock. The elements of the array are all initialized to 0 except the first element which is takes the value 1, thus ensuring successful lock acquisition by the first thread in the queue. On a lock release, the hold is passed to the next thread in queue by setting the next element of the array to 1. The requests are granted to the threads in FIFO ordering.

Pseudo Code example is listed below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ABQL_init(int *next_ticket, int *can_serve)
{
*next_ticket = 0;
for (int i = 1; i < MAXSIZE; i++)
can_serve[i] = 0;
can_serve[0] = 1;
}

ABQL_acquire(int *next_ticket, int *can_serve)
{
*my_ticket = fetch_and_add(next_ticket);
while (can_serve [*my_ticket] != 1) ;
}

ABQL_release (int *can_serve)
{
can_serve[*my_ticket + 1] = 1;
can_serve[*my_ticket] = 0; // prepare for next time
}

代码解读 3 variables are introduced namely can_serve, next_ticket and my_ticket. The roles of each are described below:

  • can_serve array represents the unique memory locations that the threads waiting in the queue for the lock acquisitions spin on.

  • next_ticket represents the next available ticket number that is assigned to a new thread.

  • my_ticket represents the ticket thread of each unique thread in the queue.

  • Performance metrics

The following performance criteria can be used to analyse the lock implementations:

  • Uncontended lock-acquisition latency - It is defined as the time taken by a thread to acquire a lock when there is no contention between threads. Due to relatively more number of instructions being executed as opposed to other lock implementations, the uncontented lock acquisition latency for ABQL is high.
  • Traffic - It is defined as number of bus transactions generated which is dependent on the number of threads in contention for the lock. On a lock release, only 1 cache block is invalidated thus resulting in a single cache miss. This results in much less bus traffic.
  • Fairness - It ensures that all processors waiting to acquire the lock are able to do so in the order in which their requests are issued. Due to the threads waiting in a queue to acquire a lock with each thread spinning on an individual memory location, fairness is ensured.
  • Storage - The amount of memory required for the processing of the lock mechanism. The storage requirement scales with the number of threads due to the increase in the size of the array can_serve.

Comparison of locks

The more basic locking mechanisms have lower uncontended latency than the advanced locking mechanisms.

Comparing Performance of Different Locking Mechanisms

| Criteria | [test-and-set](https://en.wikipedia.org/wiki/Test-and-set "Test-and-set") | [Test and Test-and-set](https://en.wikipedia.org/wiki/Test_and_Test-and-set "Test and Test-and-set") | [Load-link/store-conditional](https://en.wikipedia.org/wiki/Load-link/store-conditional "Load-link/store-conditional") | Ticket | [ABQL](https://en.wikipedia.org/wiki/Array_Based_Queuing_Locks "Array Based Queuing Locks") | | --- | --- | --- | --- | --- | --- | | Uncontended latency | Lowest | Lower | Lower | Higher | Higher | | 1 Release max traffic | $\theta (p)$ | $\theta (p)$ | $\theta (p)$ | $\theta (p)$ | $\theta (1)$ | | Wait traffic | High | - | - | - | - | | Storage | $\theta (1)$ | $\theta (1)$ | $\theta (1)$ | $\theta (1)$ | $\theta (p)$ | | Fairness guarantee | No | No | No | Yes | Yes |

28.12 Summary: So Much Spinning

Spinning 不是好算法

  1. Any time a thread gets caught spinning in a situation like this, it wastes an entire time slice doing nothing but checking a value that isn’t going to change.
  2. The problem gets worse with $N$ threads contending for a lock; $N − 1$ time slices may be wasted in a similar manner, simply spinning and waiting for a single thread to release the lock.

THE CRUX: HOW TO AVOID SPINNING
How can we develop a lock that doesn’t needlessly waste time spinning on the CPU?

28.13 A Simple Approach: Just Yield, Baby

不使用 spinning 的最简单思路

  • Our first try is a simple and friendly approach: when you are going to spin, instead give up the CPU to another thread.
  • We assume an operating system primitive yield() which a thread can call when it wants to give up the CPU and let another thread run. you can think of this as an OS system call that moves the caller from the running state to the ready state, and thus promotes another thread to running.

C pesudocode

1
2
3
4
5
6
7
8
9
10
11
12
void init() {
flag = 0;
}

void lock() {
while (TestAndSet(&flag, 1) == 1)
yield(); // give up the CPU
}

void unlock() {
flag = 0;
}
  • 缺点
    • this approach is still costly: the cost of a context switch can be substantial, and there is thus plenty of waste.
    • Fairness: worse, a thread may get caught in an endless yield loop while other threads repeatedly enter and exit the critical section.

28.14 Using Queues: Sleeping Instead Of Spinning

引入背景: 针对上节的思路, The real problem with our previous approaches is that they leave too much to chance. The scheduler determines which thread runs next; if the scheduler makes a bad choice, a thread runs that must either spin waiting for the lock (our first approach), or yield the CPU immediately (our second approach).

改进思想: A lock that puts a caller to sleep if it tries to acquire a held lock and wakes it when the lock is free.

  • 系统 API:
    two calls: park() to put a calling thread to sleep, and unpark(threadID) to wake a particular thread as designated by threadID.

  • Implementation example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
typedef struct __lock_t {
int flag;
int guard;
queue_t *q;
} lock_t;

void lock_init(lock_t *m) {
m->flag = 0;
m->guard = 0;
queue_init(m->q);
}

void lock(lock_t *m) {
while (TestAndSet(&m->guard, 1) == 1)
; //acquire guard lock by spinning
if (m->flag == 0) {
m->flag = 1; // lock is acquired
m->guard = 0;
} else {
queue_add(m->q, gettid());
m->guard = 0;
park();//yield
}
}

void unlock(lock_t *m) {
while (TestAndSet(&m->guard, 1) == 1)
; //acquire guard lock by spinning
if (queue_empty(m->q))
m->flag = 0; // let go of lock; no one wants it
else
unpark(queue_remove(m->q)); // hold lock (for next thread!)
m->guard = 0;
}
  • 解读

    • Performance: the time spent spinning is quite limited (just a few instructions inside the lock and unlock code, instead of the user-defined critical section), and thus this approach may be reasonable.
    • In lock(), when a thread can not acquire the lock (it is already held), we are careful to add ourselves to a queue (by calling the gettid() call to get the thread ID of the current thread), set guard to 0, and yield the CPU.
    • The interesting fact that the flag does not get set back to 0 when another thread gets woken up. Why is this? Well, it is not an error, but rather a necessity! When a thread is woken up, it will be as if it is returning from park(); however, it does not hold the guard at that point in the code and thus cannot even try to set the flag to 1. Thus, we just pass the lock directly from the thread releasing the lock to the next thread acquiring it; flag is not set to 0 in-between.
  • 存在的问题与解决

问题: the perceived race condition in the solution, just before the call to park(). With just the wrong timing, a thread will be about to park, assuming that it should sleep until the lock is no longer held. A switch at that time to another thread (say, a thread holding the lock) could lead to trouble, for example, if that thread then released the lock. The subsequent park by the first thread would then sleep forever (potentially, 例如没有被加入到 queue 里). This problem is sometimes called the wakeup/waiting race;

解决1: Solaris solves this problem by adding a third system call: setpark(). By calling this routine, a thread can indicate it is about to park. If it then happens to be interrupted and another thread calls unpark before park is actually called, the subsequent park returns immediately instead of sleeping.

1
2
3
queue_add(m->q, gettid());
setpark(); // new code
m->guard = 0;

解决2: A different solution could pass the guard into the kernel. In that case, the kernel could take precautions to atomically release the lock and dequeue the running thread.

28.15 Different OS, Different Support

上节思路在 Linux 下的应用.
Linux provides something called a futex. each futex has associated with it a specific physical memory location; associated with each such memory location is an in-kernel queue. Callers can use futex calls to sleep and wake as need be.

  • The call to futex_wait(address, expected) puts the calling thread to sleep.
  • The call to the routine futex_wake(address) wakes one thread that is waiting on the queue.
  • code snippet from lowlevellock.h in the nptl library.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void mutex_lock (int *mutex) {
int v;
/* Bit 31 was clear, we got the mutex (this is the fastpath) */
if (atomic_bit_test_set (mutex, 31) == 0)
return;
atomic_increment (mutex);
while (1) {
if (atomic_bit_test_set (mutex, 31) == 0) {
atomic_decrement (mutex);
return;
}
/* We have to wait now. First make sure the futex value
we are monitoring is truly negative (i.e. locked). */
v = *mutex;
if (v >= 0)
continue;
futex_wait (mutex, v);
}
}

void mutex_unlock (int *mutex) {
/* Adding 0x80000000 to the counter results in 0 if and only if
there are not other interested threads */
if (atomic_add_zero (mutex, 0x80000000))
return;

/* There are other threads waiting for this mutex,
wake one of them up.*/
futex_wake (mutex);
}

思路总结: Basically, it uses a single integer to track both whether the lock is held or not (the high bit of the integer) and the number of waiters on the lock (all the other bits). Thus, if the lock is negative, it is held (because the high bit is set and that bit determines the sign of the integer). The code is also interesting because it shows how to optimize for the common case where there is no contention: with only one thread acquiring and releasing a lock, very little work is done (the atomic bit test-and-set to lock and an atomic add to release the lock).

28.16 Two-Phase Locks

进一步改进的思路:
If the lock is not acquired during the first spin phase, a second phase is entered, where the caller is put to sleep, and only woken up when the lock becomes free later. The Linux lock above is a form of such a lock, but it only spins once; a generalization of this could spin in a loop for a fixed amount of time before using futex support to sleep.

28.17 Summary

Some hardware support (in the form of a more powerful instruction) plus some operating system support (e.g., in the form of park() and unpark() primitives on Solaris, or futex on Linux).

Chapter 29 Lock-based Concurrent Data Structures

本章概要:

  • 简单介绍并发下的典型数据结构: counter, list, queue, hash table.
  • 解决 scaling 问题(并发的表现应该与 core/thread 数成正比, 但实际不是), 而引入的 sloppy counter, hand-over-hand locking.
  • 以及一些设计思想, 例如把 critical region 尽可能地变小.

29.0 Intro

CRUX: HOW TO ADD LOCKS TO DATA STRUCTURES
When given a particular data structure, how should we add locks to it, in order to make it work correctly?
Further, how do we add locks such that the data structure yields high performance, enabling many threads to access the structure at once, i.e., concurrently?

29.1 Concurrent Counters

A Counter Without Locks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
typedef struct __counter_t {
int value;
} counter_t;

void init(counter_t *c) {
c->value = 0;
}

void increment(counter_t *c) {
c->value++;
}

void decrement(counter_t *c) {
c->value--;
}

int get(counter_t *c) {
return c->value;
}

Simple But Not Scalable–A Counter With Locks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
typedef struct __counter_t {
int alue;
pthread_lock_t lock;
} counter_t;

void init(counter_t *c) {
c->value = 0;
Pthread_mutex_init(&c->lock, NULL);
}

void increment(counter_t *c) {
Pthread_mutex_lock(&c->lock);
c->value++;
Pthread_mutex_unlock(&c->lock);
}

void decrement(counter_t *c) {
Pthread_mutex_lock(&c->lock);
c->value--;
Pthread_mutex_unlock(&c->lock);
}

int get(counter_t *c) {
Pthread_mutex_lock(&c->lock);
int rc = c->value;
Pthread_mutex_unlock(&c->lock);
return rc;
}

It follows a design pattern common to the simplest and most basic concurrent data structures: it simply adds a single lock.

It is similar to a data structure built with monitors.

  • performance 分析

the performance of the synchronized counter scales poorly. Whereas a single thread can complete the million counter updates in a tiny amount of time (roughly 0.03 seconds), having two threads each update the counter one million times concurrently leads to a massive slowdown (taking over 5 seconds!). It only gets worse with more threads.

  • 追求目标:
    Ideally, you’d like to see the threads complete just as quickly on multiple processors as the single thread does on one. Achieving this end is called perfect scaling.

但是也别太过: Note that if the data structure is not too slow, you are done! No need to do something fancy if something simple will work.

Scalable Counting

下面的思路有点分治的意思, 让每个 core 有自己独立的 local counters, 然后定期合在一起就是我们要的结果.

The sloppy counter works by representing a single logical counter via numerous local physical counters, one per CPU core, as well as a single global counter.

  • local
    When a thread running on a given core wishes to increment the counter, it increments its local counter; access to this local counter is synchronized via the corresponding local lock. Because each CPU has its own local counter, threads across CPUs can update local counters without contention, and thus counter updates are scalable.

  • periodically transfer

    • the local values are periodically transferred to the global counter, by acquiring the global lock and incrementing it by the local counter’s value; the local counter is then reset to zero.
    • How often this local-to-global transfer occurs is determined by a threshold, which we call $S$ here (for sloppiness).
    • The smaller $S$ is, the more the counter behaves like the non-scalable counter above; the bigger $S$ is, the more scalable the counter, but the further off the global value might be from the actual count. One could simply acquire all the local locks and the global lock (in a specified order, to avoid deadlock) to get an exact value, but that is not scalable.
  • 运行 example(4 cores)

$$ \begin{array}{c|cccc|l} \text { Time } & L_{1} & L_{2} & L_{3} & L_{4} & G \\ \hline 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 1 & 0 \\ 2 & 1 & 0 & 2 & 1 & 0 \\ 3 & 2 & 0 & 3 & 1 & 0 \\ 4 & 3 & 0 & 3 & 2 & 0 \\ 5 & 4 & 1 & 3 & 3 & 0 \\ 6 & 5 \rightarrow 0 & 1 & 3 & 4 & 5\left(\text { from } L_{1}\right) \\ 7 & 0 & 2 & 4 & 5 \rightarrow 0 & 10\left(\text { from } L_{4}\right) \end{array} $$

Sloppy Counter Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
typedef struct __counter_t {
int global;// global count
pthread_mutex_t glock;// global lock
int local[NUMCPUS]; // local count (per cpu)
pthread_mutex_t llock[NUMCPUS];// ... and locks
int threshold; // update frequency
} counter_t;

// init: record threshold, init locks, init values
// of all local counts and global count
void init(counter_t *c, int threshold) {
c->threshold = threshold;

c->global = 0;

pthread_mutex_init(&c->glock, NULL);

int i;

for (i = 0; i < NUMCPUS; i++) {
c->local[i] = 0;
pthread_mutex_init(&c->llock[i], NULL);
}
}

// update: usually, just grab local lock and update local amount
// once local count has risen by ’threshold’, grab global
// lock and transfer local values to it
void update(counter_t *c, int threadID, int amt) {
pthread_mutex_lock(&c->llock[threadID]);
c->local[threadID] += amt; // assumes amt > 0
if (c->local[threadID] >= c->threshold) { // transfer to global
pthread_mutex_lock(&c->glock);
c->global += c->local[threadID];
pthread_mutex_unlock(&c->glock);
c->local[threadID] = 0;
}
pthread_mutex_unlock(&c->llock[threadID]);
}

// get: just return global amount (which may not be perfect)
int get(counter_t *c) {
pthread_mutex_lock(&c->glock);
int val = c->global;
pthread_mutex_unlock(&c->glock);
return val; // only approximate!
}

accuracy/performance trade-off is what sloppy counters enables.

the importance of the threshold value $S$, with four threads each incrementing the counter 1 million times on four CPUs.

29.2 Concurrent Linked Lists

讨论范围: just focus on concurrent insert;

simple error prone implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// basic node structure
typedef struct __node_t {
int key;
struct __node_t *next;
} node_t;

// basic list structure (one used per list)
typedef struct __list_t {
node_t *head;
pthread_mutex_t lock;
} list_t;

void List_Init(list_t *L) {
L->head = NULL;
pthread_mutex_init(&L->lock, NULL);
}

int List_Insert(list_t *L, int key) {
pthread_mutex_lock(&L->lock);
node_t *new = malloc(sizeof(node_t));
if (new == NULL) {
perror("malloc");
pthread_mutex_unlock(&L->lock);
return -1; // fail
}
new->key = key;
new->next = L->head;
L->head = new;
pthread_mutex_unlock(&L->lock);
return 0; // success
}

int List_Lookup(list_t *L, int key) {
pthread_mutex_lock(&L->lock);
node_t *curr = L->head;
while (curr) {
if (curr->key == key) {
pthread_mutex_unlock(&L->lock);
return 0; // success
}
curr = curr->next;
}
pthread_mutex_unlock(&L->lock);
return -1; // failure
}

This kind of exceptional control flow has been shown to be quite error prone.

modification

改进前的考虑点:

  • Assuming that malloc() itself is thread-safe, each thread can call into it without worry of race conditions or other concurrency bugs.

  • Only when updating the shared list does a lock need to be held.

  • As for the lookup routine, it is a simple code transformation to jump out of the main search loop to a single return path. Doing so again reduces the number of lock acquire/release points in the code, and thus decreases the chances of accidentally introducing bugs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
void List_Init(list_t *L) {
L->head = NULL;
pthread_mutex_init(&L->lock, NULL);
}

void List_Insert(list_t *L, int key) {
// synchronization not needed
node_t *new = malloc(sizeof(node_t));
if (new == NULL) {
perror("malloc");
return;
}
new->key = key;

// just lock critical section
pthread_mutex_lock(&L->lock);
new->next = L->head;
L->head = new;
pthread_mutex_unlock(&L->lock);
}


int List_Lookup(list_t *L, int key) {
int rv = -1;
pthread_mutex_lock(&L->lock);
node_t *curr = L->head;
while (curr) {
if (curr->key == key) {
rv = 0;
break;
}
curr = curr->next;
}
pthread_mutex_unlock(&L->lock);
return rv; // now both success and failure
}

Scaling Linked Lists

hand-over-hand locking (a.k.a. lock coupling)

思路: Instead of having a single lock for the entire list, you instead add a lock per node of the list. When traversing the list, the code first grabs the next node’s lock and then releases the current node’s lock (which inspires the name hand-over-hand).
However, in practice, it is hard to make such a structure faster than the simple single lock approach.

TIP: MORE CONCURRENCY ISN’T NECESSARILY FASTER.
Adding more locks and complexity can be your downfall. All of that said, there is one way to really know: build both alternatives (simple but less concurrent, and complex but more concurrent) and measure how they do.

TIP: BE WARY OF LOCKS AND CONTROL FLOW
A general design tip, which is useful in concurrent code as well as elsewhere, is to be wary of control flow changes that lead to function returns, exits, or other similar error conditions that halt the execution of a function.

29.3 Concurrent Queues

concurrent queue designed by Michael and Scott.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
typedef struct __node_t {
int value;
struct __node_t *next;
} node_t;

typedef struct __queue_t {
node_t *head;
node_t *tail;
pthread_mutex_t headLock;
pthread_mutex_t tailLock;
} queue_t;

void Queue_Init(queue_t *q) {
node_t *tmp = malloc(sizeof(node_t));
tmp->next = NULL;
q->head = q->tail = tmp;
pthread_mutex_init(&q->headLock, NULL);
pthread_mutex_init(&q->tailLock, NULL);
}

void Queue_Enqueue(queue_t *q, int value) {
node_t *tmp = malloc(sizeof(node_t));
assert(tmp != NULL);
tmp->value = value;
tmp->next = NULL;

pthread_mutex_lock(&q->tailLock);
q->tail->next = tmp;
q->tail = tmp;
pthread_mutex_unlock(&q->tailLock);
}

int Queue_Dequeue(queue_t *q, int *value) {
pthread_mutex_lock(&q->headLock);
node_t *tmp = q->head;
node_t *newHead = tmp->next;
if (newHead == NULL) {
pthread_mutex_unlock(&q->headLock);
return -1; // queue was empty
}
*value = newHead->value;
q->head = newHead;
pthread_mutex_unlock(&q->headLock);
free(tmp);
return 0;
}

there are two locks, one for the head of the queue, and one for the tail. the enqueue routine will only access the tail lock, and dequeue only the head lock.

技巧: add a dummy node (allocated in the queue initialization code); this dummy enables the separation of head and tail operations.

29.4 Concurrent Hash Table

讨论范围: focus on a simple hash table that does not resize.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#define BUCKETS (101)
typedef struct __hash_t {
list_t lists[BUCKETS];
} hash_t;

void Hash_Init(hash_t *H) {
int i;
for (i = 0; i < BUCKETS; i++) {
List_Init(&H->lists[i]);
}
}

int Hash_Insert(hash_t *H, int key) {
int bucket = key % BUCKETS;
return List_Insert(&H->lists[bucket], key);
}

int Hash_Lookup(hash_t *H, int key) {
int bucket = key % BUCKETS;
return List_Lookup(&H->lists[bucket], key);
}

29.5 Summary

TIP: AVOID PREMATURE OPTIMIZATION (KNUTH’S LAW)
When building a concurrent data structure, start with the most basic approach, which is to add a single big lock to provide synchronized access. By doing so, you are likely to build a correct lock; if you then find that it suffers from performance problems, you can refine it, thus only making it fast if need be. As Knuth famously stated, “Premature optimization is the root of all evil.”

You also might be interested in techniques that don’t use traditional locks at all: non-blocking data structures.

Chapter 30 Condition Variables

本章概要:

  • 介绍了 thread 之间通信的方式之一 condition variable.
  • 通过生产者-消费者模型介绍 condition variable 的使用, 以及 2 个需要注意的典型问题: while 循环取代 if 判断. 多个 thread/action 时, 仅一个 condition variable 可能不够用.
  • 当 thread 太多时引入 covering condition, 采用广播的形式通信.

30.0 Intro

最直观的思路: Parent Waiting For Child: Spin-based Approach

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
volatile int done = 0;
void *child(void *arg) {
printf("child\n");
done = 1;
return NULL;
}

int main(int argc, char *argv[]) {
printf("parent: begin\n");
pthread_t c;
Pthread_create(&c, NULL, child, NULL); // create child
while (done == 0)
; // spin
printf("parent: end\n");
return 0;
}

THE CRUX: HOW TO WAIT FOR A CONDITION
In multi-threaded programs, it is often useful for a thread to wait for some condition to become true before proceeding. The simple approach, of just spinning until the condition becomes true, is grossly inefficient and wastes CPU cycles, and in some cases, can be incorrect. Thus, how should a thread wait for a condition?

30.1 Definition and Routines

POSIX API

POSIX calls: wait() and signal()

1
2
pthread_cond_wait(pthread_cond_t *c, pthread_mutex_t *m);
pthread_cond_signal(pthread_cond_t *c);

wait() call is that it also takes a mutex as a parameter; it assumes that this mutex is locked when wait() is called. The responsibility of wait() is to release the lock and put the calling thread to sleep (atomically).

Parent Waiting For Child: Use A Condition Variable

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
int done = 0;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t c = PTHREAD_COND_INITIALIZER;

void thr_exit() {
Pthread_mutex_lock(&m);
done = 1;
Pthread_cond_signal(&c);
Pthread_mutex_unlock(&m);
}

void *child(void *arg) {
printf("child\n");
thr_exit();
return NULL;
}

void thr_join() {
Pthread_mutex_lock(&m);
while (done == 0)
Pthread_cond_wait(&c, &m);
Pthread_mutex_unlock(&m);
}

int main(int argc, char *argv[]) {
printf("parent: begin\n");
pthread_t p;
Pthread_create(&p, NULL, child, NULL);
thr_join();
printf("parent: end\n");
return 0;
}

TIP: ALWAYS HOLD THE LOCK WHILE SIGNALING

30.2 The Producer/Consumer (Bound Buffer) Problem

介绍: Producers produce data items and wish to place them in a buffer; consumers grab data items out of the buffer consume them in some way.

2 个例子:

  1. For example, in a multi-threaded web server, a producer puts HTTP requests into a work queue (i.e., the bounded buffer); consumer threads take requests out of this queue and process them.
  2. A bounded buffer is also used when you pipe the output of one program into another, e.g., grep foo file.txt | wc -l. This example runs two processes concurrently; grep writes lines from file.txt with the string foo in them to what it thinks is standard output; the UNIX shell redirects the output to what is called a UNIX pipe (created by the pipe system call). The other end of this pipe is connected to the standard input of the process wc, which simply counts the number of lines in the input stream and prints out the result. Thus, the grep process is the producer; the wc process is the consumer; between them is an in-kernel bounded buffer;

直观但坏的设计

  • The Put and Get Routines (Version 1)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    int buffer;
    int count = 0; // initially, empty

    void put(int value) {
    assert(count == 0);
    count = 1;
    buffer = value;
    }

    int get() {
    assert(count == 1);
    count = 0;
    return buffer;
    }
  • two types of threads
    one set of which we’ll call the producer threads, and the other set which we’ll call consumer threads.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
void *producer(void *arg) {
int i;
int loops = (int) arg;
for (i = 0; i < loops; i++) {
put(i);
}
}

void *consumer(void *arg) {
int i;
while (1) {
int tmp = get();
printf("%d\n", tmp);
}
}
  • A Broken Solution

    • just a single producer and a single consumer
    • Producer/Consumer: Single CV and If Statement
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cond_t cond;
mutex_t mutex;

void *producer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
Pthread_mutex_lock(&mutex); // p1
if (count == 1) // p2
Pthread_cond_wait(&cond, &mutex); // p3
put(i); // p4
Pthread_cond_signal(&cond); // p5
Pthread_mutex_unlock(&mutex); // p6
}
}

void *consumer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
Pthread_mutex_lock(&mutex); // c1
if (count == 0) // c2
Pthread_cond_wait(&cond, &mutex); // c3
int tmp = get(); // c4
Pthread_cond_signal(&cond); // c5
Pthread_mutex_unlock(&mutex); // c6
printf("%d\n", tmp);
}
}

特点:
With just a single producer and a single consumer, the code works. However, if we have more than one of these threads (e.g., two consumers), the solution has two critical problems.

运行举例: Thread Trace: Broken Solution (Version 1)

$$ \begin{array}{cc|cc|cc|c|c} \mathrm{T}_{c 1} & \text { State } & \mathrm{T}_{c 2} & \text { State } & T_{p} & \text { State } & \text { Count } & \text { Comment } \\ \hline \mathrm{c} 1 & \text { Running } & & \text { Ready } & & \text { Ready } & 0 & \\ \text { c2 } & \text { Running } & & \text { Ready } & & \text { Ready } & 0 & \\ \text { c3 } & \text { Sleep } & & \text { Ready } & & \text { Ready } & 0 & \text { Nothing to get } \\ & \text { Sleep } & & \text { Ready } & \text { p1 } & \text { Running } & 0 & \\ & \text { Sleep } & & \text { Ready } & \text { p2 } & \text { Running } & 0 & \\ & \text { Sleep } & & \text { Ready } & \text { p4 } & \text { Running } & 1 & \text { Buffer now full } \\ & \text { Ready } & & \text { Ready } & \text { p5 } & \text { Running } & 1 & \mathrm{~T}_{c 1} \text { awoken } \\ & \text { Ready } & & \text { Ready } & \text { p6 } & \text { Running } & 1 & \\ & \text { Ready } & & \text { Ready } & \text { p1 } & \text { Running } & 1 & \\ & \text { Ready } & & \text { Ready } & \text { p2 } & \text { Running } & 1 & \\ & \text { Ready } & & \text { Ready } & \text { p3 } & \text { Sleep } & 1 & \text { Buffer full; sleep } \\ & \text { Ready } & c 1 & \text { Running } & & \text { Sleep } & 1 & \mathrm{~T}_{c 2} \text { sneaks in } \ldots \\ & \text { Ready } & c 2 & \text { Running } & & \text { Sleep } & 1 & \\ & \text { Ready } & c 4 & \text { Running } & & \text { Sleep } & 0 & \ldots \text { and grabs data } \\ & \text { Ready } & c 5 & \text { Running } & & \text { Ready } & 0 & \mathrm{~T}_{p} \text { awoken } \\ & \text { Ready } & c 6 & \text { Running } & & \text { Ready } & 0 & \\ \text { c4 } & \text { Running } & & \text { Ready } & & \text { Ready } & 0 & \text { Oh oh! No data } \end{array} $$

改进1

Better, But Still Broken: While, Not If

思路原因: now consumer $T_{c1}$ wakes up and (with the lock held) immediately re-checks the state of the shared variable (c2). If the buffer is empty at that point, the consumer simply goes back to sleep (c3). The corollary if is also changed to a while in the producer (p2).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cond_t cond;
mutex_t mutex;

void *producer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
Pthread_mutex_lock(&mutex); // p1
while (count == 1) // p2
Pthread_cond_wait(&cond, &mutex); // p3
put(i); // p4
Pthread_cond_signal(&cond); // p5
Pthread_mutex_unlock(&mutex); // p6
}
}

void *consumer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
Pthread_mutex_lock(&mutex); // c1
while (count == 0) // c2
Pthread_cond_wait(&cond, &mutex); // c3
int tmp = get(); // c4
Pthread_cond_signal(&cond); // c5
Pthread_mutex_unlock(&mutex); // c6
printf("%d\n", tmp);
}
}
  • 总结 while 与 if 的区别:
    Thanks to Mesa semantics, a simple rule to remember with condition variables is to always use while loops. Sometimes you don’t have to recheck the condition, but it is always safe to do so; just do it and be happy.

  • 另一个改进点
    It has something to do with the fact that there is only one condition variable.

the consumer has emptied the buffer, it clearly should wake the producer. However, if it wakes the consumer $T_{c2}$ (which is definitely possible, depending on how the wait queue is managed), we have a problem.
Specifically, the consumer $T_{c2}$ will wake up and find the buffer empty (c2), and go back to sleep (c3). The producer Tp, which has a value to put into the buffer, is left sleeping. The other consumer thread, $T_{c1}$ , also goes back to sleep. All three threads are left sleeping.

运行举例: Thread Trace: Broken Solution (Version 2)

$$ \begin{array}{cc|cc|cc|c|c} \mathrm{T}_{c 1} & \text { State } & \mathrm{T}_{c 2} & \text { State } & T_{p} & \text { State } & \text { Count } & \text { Comment } \\ \hline \text { c1 } & \text { Running } & & \text { Ready } & & \text { Ready } & 0 & \\ \text { c2 } & \text { Running } & & \text { Ready } & & \text { Ready } & 0 & \\ \text { c3 } & \text { Sleep } & & \text { Ready } & & \text { Ready } & 0 & \text { Nothing to get } \\ & \text { Sleep } & \mathrm{c} 1 & \text { Running } & & \text { Ready } & 0 & \\ & \text { Sleep } & \text { c2 } & \text { Running } & & \text { Ready } & 0 & \\ & \text { Sleep } & \text { c3 } & \text { Sleep } & & \text { Ready } & 0 & \text { Nothing to get } \\ & \text { Sleep } & & \text { Sleep } & \text { p1 } & \text { Running } & 0 & \\ & \text { Sleep } & & \text { Sleep } & \text { p2 } & \text { Running } & 0 & \\ & \text { Sleep } & & \text { Sleep } & \text { p4 } & \text { Running } & 1 & \text { Buffer now full } \\ & \text { Ready } & & \text { Sleep } & \text { p5 } & \text { Running } & 1 & T_{c 1} \text { awoken } \\ & \text { Ready } & & \text { Sleep } & \text { p6 } & \text { Running } & 1 & \\ & \text { Ready } & & \text { Pleep } & \text { p2 } & \text { Running } & 1 & \\ \text { c2 } & \text { Ready } & & \text { Sleep } & \text { p3 } & \text { Sleep } & 1 & \\ \text { c4 } & \text { Running } & & \text { Sleep } & & \text { Sleep } & 1 & \text { Mecheck condition } \\ \text { c5 } & \text { Running } & & \text { Sleep } & & \text { Sleep } & 0 & \text { T }_{c 1} \text { grabs data } \\ \text { c6 } & \text { Running } & & \text { Ready } & & \text { Sleep } & 0 & \text { Oops! Woke } \mathrm{T}_{c 2} \\ \text { c1 } & \text { Running } & & \text { Ready } & & \text { Sleep } & 0 & \\ \text { c2 } & \text { Running } & & \text { Ready } & & \text { Sleep } & 0 & \\ \text { c3 } & \text { Sleep } & & \text { Ready } & & \text { Sleep } & 0 & \text { Nothing to get } \\ & \text { Sleep } & \text { c2 } & \text { Running } & & \text { Sleep } & 0 & \\ & \text { Sleep } & \text { c3 } & \text { Sleep } & & \text { Sleep } & 0 & \text { Everyone asleep... } \end{array} $$

第二次改进设计

The Single Buffer Producer/Consumer Solution

use two condition variables, in order to properly signal which type of thread should wake up when the state of the system changes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
cond_t empty, fill;
mutex_t mutex;

void *producer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
Pthread_mutex_lock(&mutex);
while (count == 1)
Pthread_cond_wait(&empty, &mutex);
put(i);
Pthread_cond_signal(&fill);
Pthread_mutex_unlock(&mutex);
}
}


void *consumer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
Pthread_mutex_lock(&mutex);
while (count == 0)
Pthread_cond_wait(&fill, &mutex);
int tmp = get();
Pthread_cond_signal(&empty);
Pthread_mutex_unlock(&mutex);
printf("%d\n", tmp);
}
}

The Final Producer/Consumer Solution

add more buffer slots, so that multiple values can be produced before sleeping, and similarly multiple values can be consumed before sleeping. It even allows concurrent producing or consuming to take place, thus increasing concurrency.

  • The Final Put and Get Routines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int buffer[MAX];
int fill = 0;
int use = 0;
int count = 0;

void put(int value) {
buffer[fill] = value;
fill = (fill + 1) % MAX;
count++;
}

int get() {
int tmp = buffer[use];
use = (use + 1) % MAX;
count--;
return tmp;
}
  • The Final Working Solution
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cond_t empty, fill;
mutex_t mutex;

void *producer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
Pthread_mutex_lock(&mutex); // p1
while (count == MAX) // p2
Pthread_cond_wait(&empty, &mutex); // p3
put(i);// p4
Pthread_cond_signal(&fill);// p5
Pthread_mutex_unlock(&mutex);// p6
}
}

void *consumer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
Pthread_mutex_lock(&mutex);// c1
while (count == 0)// c2
Pthread_cond_wait(&fill, &mutex);// c3
int tmp = get();// c4
Pthread_cond_signal(&empty);// c5
Pthread_mutex_unlock(&mutex);// c6
printf("%d\n", tmp);26
}
}

A producer only sleeps if all buffers are currently filled (p2); similarly, a consumer only sleeps if all buffers are currently empty (c2).

TIP: USE WHILE (NOT IF) FOR CONDITIONS

  • When checking for a condition in a multi-threaded program, using a while loop is always correct; using an if statement only might be, depending on the semantics of signaling.
  • Using while loops around conditional checks also handles the case where spurious wakeups occur. it is possible that two threads get woken up though just a single signal has taken place.

30.3 Covering Conditions

场景 in a simple multi-threaded memory allocation library.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// how many bytes of the heap are free?
int bytesLeft = MAX_HEAP_SIZE;

// need lock and condition too
cond_t c;
mutex_t m;

void * allocate(int size) {
Pthread_mutex_lock(&m);
while (bytesLeft < size)
Pthread_cond_wait(&c, &m);
void *ptr = ...; // get mem from heap
bytesLeft -= size;
Pthread_mutex_unlock(&m);
return ptr;
}

void free(void *ptr, int size) {
Pthread_mutex_lock(&m);
bytesLeft += size;
Pthread_cond_signal(&c); // whom to signal??
Pthread_mutex_unlock(&m);
}

问题:
the code does not work, as the thread waking other threads does not know which thread (or threads) to wake up.

解决, 既然不知道唤醒哪个干脆广而告之.
The solution suggested by Lampson and Redell is straightforward: replace the pthread_cond_signal() call in the code above with a call to pthread_cond_broadcast(), which wakes up all waiting threads.

解决办法也有缺点
a negative performance impact, as we might needlessly wake up many other waiting threads that shouldn’t (yet) be awake.

Lampson and Redell call such a condition a covering condition, as it covers all the cases where a thread needs to wake up (conservatively); the cost, as we’ve discussed, is that too many threads might be woken.

如果真的到广播的地步, 是有 bug 了, 换而言之此方案是比较临时的方案.
if you find that your program only works when you change your signals to broadcasts (but you don’t think it should need to), you probably have a bug; fix it!

Chapter 31 Semaphores

本章概要:

  • 介绍了 Semaphores, 一个比 lock, condition variable 抽象一级的概念.
  • Semaphores 实现 lock, condition variable 的具体做法.
  • 使用 Semaphores 解决生产者-消费者模型, reader-writer 模式的具体思路.
  • 在解决上述问题中出现的 deadlock 等问题.
  • 介绍了 The Dining Philosophers 即环形依赖下的并发问题解决思路.

31.0 Intro

THE CRUX: HOW TO USE SEMAPHORES
How can we use semaphores instead of locks and condition variables?
What is the definition of a semaphore? What is a binary semaphore?
Is it straightforward to build a semaphore out of locks and condition variables?
What about building locks and condition variables out of semaphores?

31.1 Semaphores: A Definition

POSIX API

1
2
3
#include <semaphore.h>
sem_t s;
sem_init(&s, 0, 1);
  • argument 1: an integer value

  • argument 2

    • 0: semaphore is shared between threads in the same process.
    • they can be used to synchronize access across different processes.

two functions to interact with it, sem wait() or sem post().

1
2
3
4
5
6
7
8
9
int sem_wait(sem_t *s) {
decrement the value of semaphore s by one
wait if value of semaphore s is negative
}

int sem_post(sem_t *s) {
increment the value of semaphore s by one
if there are one or more threads waiting, wake one
}

Note:

  • multiple calling threads may call into sem_wait(), and thus all be queued waiting to be woken.
  • sem_post() does not wait for some particular condition to hold like sem_wait() does.
  • the value of the semaphore, when negative, is equal to the number of waiting threads.

31.2 Binary Semaphores (Locks)

using a semaphore as a lock

1
2
3
4
5
sem_t m;
sem_init(&m, 0, 1);
sem_wait(&m);
// critical section here
sem_post(&m);

Thread Trace: Single Thread Using A Semaphore

$$ \begin{array}{c|c|c} \text { Value of Semaphore } & \text { Thread 0 } & \text { Thread 1 } \\ \hline 1 & \\ 1 & \text { call sem\_wait ( ) } \\ 0 & \text { sem\_wait ( ) returns } \\ 0 & \text { (crit sect) } \\ 0 & \text { call sem\_post () } \\ 1 & \text { sem\_post ( ) returns } \end{array} $$

Thread Trace: Two Threads Using A Semaphore

$$ \begin{array}{c|cc|cc} \text { Value } & \text { Thread } 0 & \text { State } & \text { Thread 1 } & \text { State } \\ \hline 1 & & \text { Running } & & \text { Ready } \\ 1 & \text { call sem\_wait ( ) } & \text { Running } & & \text { Ready } \\ 0 & \text { sem\_wait ( ) returns } & \text { Running } & & \text { Ready } \\ 0 & \text { (crit sect }: \text { begin) } & \text { Running } & & \text { Ready } \\ 0 & \text { Interrupt; Switch } \rightarrow \text { T1 } & \text { Ready } & & \text { Running } \\ 0 & & \text { Ready } & \text { call sem\_wait () } & \text { Running } \\ -1 & & \text { Ready } & \text { decrement sem } & \text { Running } \\ -1 & & \text { Ready } & \text { (sem }<0) \rightarrow \text { sleep } & \text { Sleeping } \\ -1 & & \text { Running } & \text { Switch } \rightarrow \text { T0 } & \text { Sleeping } \\ -1 & \text { (crit sect }: \text { end ) } & \text { Running } & & \text { Sleeping } \\ -1 & \text { call sem\_post ( ) } & \text { Running } & & \text { Sleeping } \\ 0 & \text { increment sem } & \text { Running } & & \text { Sleeping } \\ 0 & \text { wake ( } 1 \text { ) } & \text { Running } & & \text { Ready } \\ 0 & \text { sem\_post ( ) returns } & \text { Running } & & \text { Ready } \\ 0 & \text { Interrupt; Switch } \rightarrow \text { T1 } & \text { Ready } & & \text { Running } \\ 0 & & \text { Ready } & \text { sem\_wait () returns } & \text { Running } \\ 0 & & \text { Ready } & \text { (crit sect) } & \text { Running } \\ 0 & & \text { Ready } & \text { call sem\_post ( ) } & \text { Running } \\ 1 & & \text { Ready } & \text { sem\_post () returns } & \text { Running } \end{array} $$

31.3 Semaphores As Condition Variables

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
sem_t s;

void *
child(void *arg) {
printf("child\n");
sem_post(&s); // signal here: child is done
return NULL;
}

int
main(int argc, char *argv[]) {
sem_init(&s, 0, X); // what should X be?
printf("parent: begin\n");
pthread_t c;
Pthread_create(c, NULL, child, NULL);
sem_wait(&s); // wait here for child
printf("parent: end\n");
return 0;
}

First case the parent creates the child but the child has not run yet.

$$ \begin{array}{c|lc|cc} \text { Value } & \text { Parent } & \text { State } & \text { Child } & \text { State } \\ \hline 0 & \text { create (child) } & \text { Running } & \text { (Child exists; is runnable) } & \text { Ready } \\ 0 & \text { call sem\_wait () } & \text { Running } & & \text { Ready } \\ -1 & \text { decrement sem } & \text { Running } & & \text { Ready } \\ -1 & (\text { sem }<0) \rightarrow \text { sleep } & \text { Sleeping } & & \text { Ready } \\ -1 & \text { Switch } \rightarrow \text { Child } & \text { Sleeping } & \text { child runs } & \text { Running } \\ -1 & & \text { Sleeping } & \text { call sem\_post () } & \text { Running } \\ 0 & & \text { Sleeping } & \text { increment sem } & \text { Running } \\ 0 & & \text { Ready } & \text { wake (Parent) } & \text { Running } \\ 0 & & \text { Ready } & \text { sem\_post () returns } & \text { Running } \\ 0 & & \text { Ready } & \text { Interrupt; Switch Parent } & \text { Ready } \\ 0 & \text { sem\_wait ( ) returns } & \text { Ready } & & \text { Ready } \end{array} $$

The second case occurs when the child runs to completion before the parent gets a chance to call sem_wait().

$$ \begin{array}{c|lc|cc} \text { Value } & \text { Parent } & \text { State } & \text { Child } & \text { State } \\ \hline 0 & \text { create (Child) } & \text { Running } & \text { (Child exists; is runnable) } & \text { Ready } \\ 0 & \text { Interrupt; Switch } \rightarrow \text { Child } & \text { Ready } & \text { child runs } & \text { Running } \\ 0 & & \text { Ready } & \text { call sem\_post () } & \text { Running } \\ 1 & & \text { Ready } & \text { increment sem } & \text { Running } \\ 1 & & \text { Ready } & \text { wake (nobody) } & \text { Running } \\ 1 & & \text { Ready } & \text { sem\_post () returns } & \text { Running } \\ 1 & \text { parent runs } & \text { Running } & \text { Interrupt; Switch } \rightarrow \text { Parent } & \text { Ready } \\ 1 & \text { call sem\_wait () } & \text { Running } & & \text { Ready } \\ 0 & \text { decrement sem } & \text { Running } & & \text { Ready } \\ 0 & (\text { sem } \geq 0) \rightarrow \text { awake } & \text { Running } & & \text { Ready } \\ 0 & \text { sem\_wait () returns } & \text { Running } & & \text { Ready } \end{array} $$

31.4 The Producer/Consumer (Bounded-Buffer) Problem

First Attempt

The code for the put and get routines

1
2
3
4
5
6
7
8
9
10
11
12
13
14
int buffer[MAX];
int fill = 0;
int use = 0;

void put(int value) {
buffer[fill] = value;// line f1
fill = (fill + 1) % MAX; // line f2
}

int get() {
int tmp = buffer[use];// line g1
use = (use + 1) % MAX;// line g2
return tmp;
}

attempt at solving the producer and consumer problem

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
sem_t empty;
sem_t full;

void *producer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
sem_wait(&empty);// line P1
put(i);// line P2
sem_post(&full);// line P3
}
}

void *consumer(void *arg) {
int i, tmp = 0;
while (tmp != -1) {
sem_wait(&full);// line C1
tmp = get();// line C2
sem_post(&empty);// line C3
printf("%d\n", tmp);
}
}

int main(int argc, char *argv[]) {
// ...
sem_init(&empty, 0, MAX); // MAX buffers are empty to begin with...
sem_init(&full, 0, 0);// ... and 0 are full
// ...
}

问题:
MAX is greater than 1, now have a problem: a race condition.

A Solution: Adding Mutual Exclusion

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
sem_t empty;
sem_t full;
sem_t mutex;

void *producer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
sem_wait(&mutex);// line p0 (NEW LINE)
sem_wait(&empty);// line p1
put(i);// line p2
sem_post(&full);// line p3
sem_post(&mutex);// line p4 (NEW LINE)
}
}

void *consumer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
sem_wait(&mutex);// line c0 (NEW LINE)
sem_wait(&full);// line c1
int tmp = get();// line c2
sem_post(&empty);// line c3
sem_post(&mutex);// line c4 (NEW LINE)
printf("%d\n", tmp);
}
}

int main(int argc, char *argv[]) {
// ...
sem_init(&empty, 0, MAX); // MAX buffers are empty to begin with...
sem_init(&full, 0, 0);// ... and 0 are full
sem_init(&mutex, 0, 1);// mutex=1 because it is a lock (NEW LINE)
// ...
}

还是有问题: Deadlock
That seems like the right idea, but it also doesn’t work. Why? Deadlock.

Avoiding Deadlock

原因分析: The consumer holds the mutex and is waiting for the someone to signal full. The producer could signal full but is waiting for the mutex. Thus, the producer and consumer are each stuck waiting for each other: a classic deadlock.

Finally, A Working Solution
To solve this problem, we simply must reduce the scope of the lock. move the mutex acquire and release to be just around the critical section.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
sem_t empty;
sem_t full;
sem_t mutex;

void *producer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
sem_wait(&empty);// line p1
sem_wait(&mutex);// line p1.5 (MOVED MUTEX HERE...)
put(i);// line p2
sem_post(&mutex);// line p2.5 (... AND HERE)
sem_post(&full);// line p3
}
}

void *consumer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
sem_wait(&full);// line c1
sem_wait(&mutex);// line c1.5 (MOVED MUTEX HERE...)
int tmp = get();// line c2
sem_post(&mutex);// line c2.5 (... AND HERE)
sem_post(&empty);// line c3
printf("%d\n", tmp);
}
}

int main(int argc, char *argv[]) {
// ...
sem_init(&empty, 0, MAX); // MAX buffers are empty to begin with...
sem_init(&full, 0, 0);// ... and 0 are full
sem_init(&mutex, 0, 1);// mutex=1 because it is a lock
// ...
}

31.5 Reader-Writer Locks

Reader-Writer 的思想这里不做介绍, 还是 lazy 的设计哲学.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
typedef struct _rwlock_t {
sem_t lock;// binary semaphore (basic lock)
sem_t writelock; // used to allow ONE writer or MANY readers
int readers;// count of readers reading in critical section
} rwlock_t;

void rwlock_init(rwlock_t *rw) {
rw->readers = 0;
sem_init(&rw->lock, 0, 1);
sem_init(&rw->writelock, 0, 1);
}

void rwlock_acquire_readlock(rwlock_t *rw) {
sem_wait(&rw->lock);
rw->readers++;
if (rw->readers == 1)
sem_wait(&rw->writelock); // first reader acquires writelock
sem_post(&rw->lock);
}

void rwlock_release_readlock(rwlock_t *rw) {
sem_wait(&rw->lock);
rw->readers--;
if (rw->readers == 0)
sem_post(&rw->writelock); // last reader releases writelock
sem_post(&rw->lock);
}

void rwlock_acquire_writelock(rwlock_t *rw) {
sem_wait(&rw->writelock);
}

void rwlock_release_writelock(rwlock_t *rw) {
sem_post(&rw->writelock);
}

最终效果: Thus, once a reader has acquired a read lock, more readers will be allowed to acquire the read lock too; however, any thread that wishes to acquire the write lock will have to wait until all readers are finished; the last one to exit the critical section calls sem_post() on “writelock” and thus enables a waiting writer to acquire the lock.

缺点:
some negatives, especially when it comes to fairness. In particular, it would be relatively easy for readers to starve writers.

reader-writer locks should be used with some caution. They often add more overhead (especially with more sophisticated implementations), and thus do not end up speeding up performance as compared to just using simple and fast locking primitives.

TIP: SIMPLE AND DUMB CAN BE BETTER (HILL’S LAW)

31.6 The Dining Philosophers

一个比较抽象的问题:
One of the most famous concurrency problems posed, and solved, by Dijkstra, is known as the dining philosopher’s problem. however, its practical utility is low.

Description

the basic loop of each philosopher:

1
2
3
4
5
6
while (1) {
think();
getforks();
eat();
putforks();
}
  • The key challenge
    is to write the routines getforks() and putforks() such that there is no deadlock, no philosopher starves and never gets to eat, and concurrency is high (i.e., as many philosophers can eat at the same time as possible).

  • helper functions

1
2
3
4
5
int left(int p)
{ return p; }

int right(int p)
{ return (p + 1) % 5; }
  • five, one for each fork: sem_t forks[5].

Broken Solution

Assume we initialize each semaphore (in the forks array) to a value of 1. Assume also that each philosopher knows its own number (p).

1
2
3
4
5
6
7
8
9
void getforks() {
sem_wait(forks[left(p)]);
sem_wait(forks[right(p)]);
}

void putforks() {
sem_post(forks[left(p)]);
sem_post(forks[right(p)]);
}

The problem is deadlock. If each philosopher happens to grab the fork on their left before any philosopher can grab the fork on their right, each will be stuck holding one fork and waiting for another, forever.

A Solution: Breaking The Dependency

this is how Dijkstra himself solved the problem. Specifically, let’s assume that philosopher 4 (the highest numbered one) acquires the forks in a different order.

1
2
3
4
5
6
7
8
9
void getforks() {
if (p == 4) {
sem_wait(forks[right(p)]);
sem_wait(forks[left(p)]);
} else {
sem_wait(forks[left(p)]);
sem_wait(forks[right(p)]);
}
}

Because the last philosopher tries to grab right before left, there is no situation where each philosopher grabs one fork and is stuck waiting for another; the cycle of waiting is broken.

There are other “famous” problems like this one, e.g., the cigarette smoker’s problem or the sleeping barber problem.

31.7 How To Implement Semaphores

通过封装成 lock, condition variable 来实现.
let’s use our low-level synchronization primitives, locks and condition variables, to build our own version of semaphores called …(drum roll here) … Zemaphores.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
typedef struct __Zem_t {
int value;
pthread_cond_t cond;
pthread_mutex_t lock;
} Zem_t;

// only one thread can call this
void Zem_init(Zem_t *s, int value) {
s->value = value;
Cond_init(&s->cond);
Mutex_init(&s->lock);
}

void Zem_wait(Zem_t *s) {
Mutex_lock(&s->lock);
while (s->value <= 0)
Cond_wait(&s->cond, &s->lock);
s->value--;
Mutex_unlock(&s->lock);
}

void Zem_post(Zem_t *s) {
Mutex_lock(&s->lock);
s->value++;
Cond_signal(&s->cond);
Mutex_unlock(&s->lock);
}

One subtle difference between our Zemaphore and pure semaphores as defined by Dijkstra is that we don’t maintain the invariant that the value of the semaphore, when negative, reflects the number of waiting threads; indeed, the value will never be lower than zero. This behavior is easier to implement and matches the current Linux implementation.

TIP: BE CAREFUL WITH GENERALIZATION
One could view semaphores as a generalization of locks and condition variables; however, is such a generalization needed? And, given the difficulty of realizing a condition variable on top of a semaphore, perhaps this generalization is not as general as you might think.

Chapter 32 Common Concurrency Problems

本章概要

  • 总结了常见的并发开发中的问题, 介绍了 Atomicity-Violation Bugs 与 Order-Violation Bugs 以及 Deadlock Bugs.
  • 对 Atomicity-Violation Bugs 使用锁, 对 Order-Violation Bugs 使用 condition variables.
  • 对 Deadlock Bugs 的发生总结了发生的 4 个必要条件, 并根据必要条件总结了避免 Deadlock Bugs 的思路: Circular Wait, Hold-and-wait, No Preemption, Mutual Exclusion. 并提出不太可行的用 scheduling 代替锁等的使用的思路.
  • 每一种思路都需要 programmer, hardware API support, performance, 小概率失败等多种因素 trade-off.
  • 谈及了监控 deadlock 的技术: Detect and Recover.
  • 肯定了 Google Mapreduce 算法的未来.

32.0 Intro

CRUX: HOW TO HANDLE COMMON CONCURRENCY BUGS
Concurrency bugs tend to come in a variety of common patterns. Knowing which ones to look out for is the first step to writing more robust, correct concurrent code.

32.1 What Types Of Bugs Exist?

利用前人的研究结果.
We rely upon a study by Lu et al. [L+08], which analyzes a number of popular concurrent applications in great detail to understand what types of bugs arise in practice.

32.2 Non-Deadlock Bugs

Atomicity-Violation Bugs

1
2
3
4
5
6
7
8
9
Thread 1::
if (thd->proc_info) {
...
fputs(thd->proc_info, ...);
...
}

Thread 2::
thd->proc_info = NULL;

定义: more formal definition of an atomicity violation is this: “The desired serializability among multiple memory accesses is violated“.

解决: simply add locks around the shared-variable references.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

Thread 1::
pthread_mutex_lock(&lock);
if (thd->proc_info) {
...
fputs(thd->proc_info, ...);
...
}
pthread_mutex_unlock(&lock);

Thread 2::
pthread_mutex_lock(&lock);
thd->proc_info = NULL;
pthread_mutex_unlock(&lock);

Order-Violation Bugs

例子代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
Thread 1::
void init() {
...
mThread = PR_CreateThread(mMain, ...);
...
}

Thread 2::
void mMain(...) {
...
mState = mThread->State;
...
}

问题描述: the code in Thread 2 seems to assume that the variable mThread has already been initialized (and is not NULL); however, if Thread 1 does not happen to run first, we are out of luck, and Thread 2 will likely crash with a NULL pointer dereference.

定义: The more formal definition of an order violation is this: “The desired order between two (groups of) memory accesses is flipped“.

解决: The fix to this type of bug is generally to enforce ordering, using condition variables.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
pthread_mutex_t mtLock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t mtCond = PTHREAD_COND_INITIALIZER;
int mtInit = 0;

Thread 1::
void init() {
...
mThread = PR_CreateThread(mMain, ...);

// signal that the thread has been created...
pthread_mutex_lock(&mtLock);
mtInit = 1;
pthread_cond_signal(&mtCond);
pthread_mutex_unlock(&mtLock);
...
}

Thread 2::
void mMain(...) {
...
// wait for the thread to be initialized...
pthread_mutex_lock(&mtLock);
while (mtInit == 0)
pthread_cond_wait(&mtCond, &mtLock);
pthread_mutex_unlock(&mtLock);

mState = mThread->State;
...
}

Non-Deadlock Bugs: Summary

除了上面 2 种, 还有更多: A large fraction (97%) of non-deadlock bugs studied by Lu et al. are either atomicity or order violations.

32.3 Deadlock Bugs

Deadlock occurs, for example, when a thread (say Thread 1) is holding a lock (L1) and waiting for another one (L2); unfortunately, the thread (Thread 2) that holds lock L2 is waiting for L1 to be released.

deadlock does not necessarily occur; rather, it may occur.

CRUX: HOW TO DEAL WITH DEADLOCK
How should we build systems to prevent, avoid, or at least detect and recover from deadlock?
Is this a real problem in systems today?

Why Do Deadlocks Occur?

  1. circular dependencies
  2. the nature of encapsulation.
    For example, take the Java Vector class and the method AddAll().
    1
    2
    Vector v1, v2;
    v1.AddAll(v2);
    The routine acquires said locks in some arbitrary order (say v1 then v2) in order to add the contents of v2 to v1. If some other thread calls v2. AddAll(v1) at nearly the same time, we have the potential for deadlock, all in a way that is quite hidden from the calling application.
  • Conditions for Deadlock

    • Mutual exclusion: Threads claim exclusive control of resources that they require (e.g., a thread grabs a lock).
    • Hold-and-wait: Threads hold resources allocated to them (e.g., locks that they have already acquired) while waiting for additional resources (e.g., locks that they wish to acquire).
    • No preemption: Resources (e.g., locks) cannot be forcibly removed from threads that are holding them.
    • Circular wait: There exists a circular chain of threads such that each thread holds one more resources (e.g., locks) that are being requested by the next thread in the chain.
    • If any of these four conditions are not met, deadlock cannot occur.

这四个原因也是 Prevention of deadlock 的四个思路.

1. Circular Wait

How?: to provide a total ordering on lock acquisition.
缺点:

  1. this approach requires careful design of global locking strategies and must be done with great care. Further, it is just a convention, and a sloppy programmer can easily ignore the locking protocol and potentially cause deadlock.
  2. Finally, it requires a deep understanding of the code base, and how various routines are called; just one mistake could result in the wrong ordering of lock acquisition, and hence deadlock.

2. Hold-and-wait

acquiring all locks at once, atomically. 加一层包裹.

1
2
3
4
5
lock(prevention);
lock(L1);
lock(L2);
...
unlock(prevention);

思路: By first grabbing the lock prevention, this code guarantees that no untimely thread switch can occur in the midst of lock acquisition and thus deadlock can once again be avoided.
Note that the solution is problematic for a number of reasons.

  • Encapsulation works against us.
  • Further, the approach likely decreases concurrency as all locks must be acquired early on (at once) instead of when they are truly needed.

3. No Preemption

硬件接口 trylock()
Many thread libraries provide a more flexible set of interfaces to help avoid this situation. Specifically, a trylock() routine will grab the lock (if it is available) or return -1 indicating that the lock is held right now and that you should try again later if you want to grab that lock.
a deadlock-free, ordering-robust lock acquisition protocol:

1
2
3
4
5
6
top:
lock(L1);
if (trylock(L2) == -1) {
unlock(L1);
goto top;
}

Note that another thread could follow the same protocol but grab the locks in the other order (L2 then L1) and the program would still be dead-lock free.

livelock
trylock() 的使用: One new problem does arise: livelock.

  • It is possible (though perhaps unlikely) that two threads could both be repeatedly attempting this sequence and repeatedly failing to acquire both locks. In this case, both systems are running through this code sequence over and over again (and thus it is not a deadlock), but progress is not being made, hence the name livelock.

解决: There are solutions to the livelock problem, too:

  1. one could add a random delay before looping back and trying the entire thing over again, thus decreasing the odds of repeated interference among competing threads.

  2. One final point about this solution: it skirts around the hard parts of using a trylock approach. The first problem that would likely exist again arises due to encapsulation. 如下:

    • if one of these locks is buried in some routine that is getting called, the jump back to the beginning becomes more complex to implement.
    • If the code had acquired some resources (other than L1) along the way, it must make sure to carefully release them as well.

4. Mutual Exclusion

to avoid the need for mutual exclusion at all.
How?
one could design various data structures to be wait-free. The idea here is simple: using powerful hardware instructions, you can build data structures in a manner that does not require explicit locking.

  • an atomic instruction provided by the hardware that does the following
1
2
3
4
5
6
7
int CompareAndSwap(int *address, int expected, int new) {
if (*address == expected) {
*address = new;
return 1; // success
}
return 0; // failure
}
  • atomically increment a value by a certain amount.
1
2
3
4
5
void AtomicIncrement(int *value, int amount) {
do {
int old = *value;
} while (CompareAndSwap(value, old, old + amount) == 0);
}

思路: Instead of acquiring a lock, doing the update, and then releasing it, we have instead built an approach that repeatedly tries to update the value to the new amount and uses the compare-and-swap to do so. In this manner, no lock is acquired, and no deadlock can arise (though livelock is still a possibility).

example: list insertion. inserts at the head of a list:

1
2
3
4
5
6
7
void insert(int value) {
node_t *n = malloc(sizeof(node_t));
assert(n != NULL);
n->value = value;
n->next = head;
head = n;
}

surrounding this code with a lock acquire and release:

1
2
3
4
5
6
7
8
9
void insert(int value) {
node_t *n = malloc(sizeof(node_t));
assert(n != NULL);
n->value = value;
lock(listlock);// begin critical section
n->next = head;
head = n;
unlock(listlock); // end of critical section
}

perform this insertion in a wait-free manner simply using the compare-and-swap instruction.

1
2
3
4
5
6
7
8
void insert(int value) {
node_t *n = malloc(sizeof(node_t));
assert(n != NULL);
n->value = value;
do {
n->next = head;
} while (CompareAndSwap(&head, n->next, n));
}

Deadlock Avoidance via Scheduling

Avoidance requires some global knowledge of which locks various threads might grab during their execution, and subsequently schedules said threads in a way as to guarantee no deadlock can occur.

下面是通过 scheduling 错开运行, 避免使用锁的图解.

the cost is performance.

One famous example of an approach like this is Dijkstra’s Banker’s Algorithm.

avoidance of deadlock via scheduling is not a widely-used general-purpose solution.

TIP: DON’T ALWAYS DO IT PERFECTLY (TOM WEST’S LAW)

Detect and Recover

A deadlock detector runs periodically, building a resource graph and checking it for cycles. In the event of a cycle (deadlock), the system needs to be restarted. If more intricate repair of data structures is first required, a human being may be involved to ease the process.

32.4 Summary

Perhaps the best solution is to develop new concurrent programming models: in systems such as MapReduce (from Google), programmers can describe certain types of parallel computations without any locks whatsoever. Locks are problematic by their very nature; perhaps we should seek to avoid using them unless we truly must.

Chapter 33 Event-based Concurrency (Advanced)

本章概要

  • 介绍了基于事件的并发开发. 为什么要引入它. 基础的思想是 event loop.
  • 接口 select()poll() 的介绍.
  • 好处是不用管底层的 threads 怎么做, 用近乎 scheduling 的方式增大对运行的操控.
  • 问题, 一是无法使用 system calls, 二是需要花成本管理 event 之间切换的 state management.
  • 解决办法: 一是采用异步的形式进行并发, 二是采用 continuation 技术.
  • 基于事件的并发的其他问题: 进入到多核时代, 基于事件的并发还是要考虑锁, 条件变量这些, 优势也没有那么明显. page faults 不好解决. 需要对事件的语义进行厘清. 异步 I/O 并没有完全普及.

感触
这一章可能是距离 C++ 11 进行并行开发最近的一章. 无论是 std::task_package 还是 std::async 都是基于这种 event-based concurrency 的实现.

33.0 Intro

为什么要引入 Event-based Concurrency?
The problem that event-based concurrency addresses is two-fold.

  1. managing concurrency correctly in multi-threaded applications can be challenging.
  2. a multi-threaded application, the developer has little or no control over what is scheduled at a given moment in time.

THE CRUX:
HOW TO BUILD CONCURRENT SERVERS WITHOUT THREADS
How can we build a concurrent server without using threads, and thus retain control over concurrency as well as avoid some of the problems that seem to plague multi-threaded applications?

33.1 The Basic Idea: An Event Loop

思路: The approach is quite simple: you simply wait for something (i.e., an “event”) to occur; when it does, you check what type of event it is and do the small amount of work it requires (which may include issuing I/O requests, or scheduling other events for future handling, etc.).
实现: a simple construct known as the event loop.

1
2
3
4
5
while (1) {
events = getEvents();
for (e in events)
processEvent(e);
}
  • the code that processes each event is known as an event handler.
  • deciding which event to handle next is equivalent to scheduling. This explicit control over scheduling is one of the fundamental advantages of the event-based approach.

33.2 An Important API: select() (or poll())

how to receive events. In most systems, a basic API is available, via either the select() or poll() system calls.

  1. select()
1
2
3
4
5
int select(int nfds,
fd_set *restrict readfds,
fd_set *restrict writefds,
fd_set *restrict errorfds,
struct timeval *restrict timeout);

First, note that it lets you check whether descriptors can be read from as well as written to.
Second, note the timeout argument. One common usage here is to set the timeout to NULL, which causes select() to block indefinitely, until some descriptor is ready. However, more robust servers will usually specify some kind of timeout; one common technique is to set the timeout to zero, and thus use the call to select() to return immediately.

  1. The poll() system call is quite similar.

ASIDE: BLOCKING VS. NON-BLOCKING INTERFACES 同步 VS 异步
Blocking (or synchronous) interfaces do all of their work before returning to the caller; non-blocking (or asynchronous) interfaces begin some work but return immediately, thus letting whatever work that needs to be done get done in the background.

  • The usual culprit in blocking calls is I/O of some kind. For example, if a call must read from disk in order to complete, it might block, waiting for the I/O request that has been sent to the disk to return.
  • Non-blocking interfaces can be used in any style of programming (e.g., with threads), but are essential in the event-based approach, as a call that blocks will halt all progress.

33.3 Using select()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

int main(void) {
// open and set up a bunch of sockets (not shown)
// main loop
while (1) {
// initialize the fd_set to all zero
fd_set readFDs;
FD_ZERO(&readFDs);

// now set the bits for the descriptors
// this server is interested in
// (for simplicity, all of them from min to max)
int fd;
for (fd = minFD; fd < maxFD; fd++)
FD_SET(fd, &readFDs);

// do the select
int rc = select(maxFD+1, &readFDs, NULL, NULL, NULL);

// check which actually have data using FD_ISSET()
int fd;
for (fd = minFD; fd < maxFD; fd++)
if (FD_ISSET(fd, &readFDs))
processFD(fd);
}
}

33.4 Why Simpler? No Locks Needed

because only one event is being handled at a time, there is no need to acquire or release locks; the event-based server cannot be interrupted by another thread because it is decidedly single threaded.

TIP: DON’T BLOCK IN EVENT-BASED SERVERS
Event-based servers enable fine-grained control over scheduling of tasks. However, to maintain such control, no call that blocks the execution the caller can ever be made; failing to obey this design tip will result in a blocked event-based server, frustrated clients, and serious questions as to whether you ever read this part of the book.

33.5 A Problem: Blocking System Calls

问题引入: What if an event requires that you issue a system call that might block?

With an event-based approach, however, there are no other threads to run: just the main event loop. And this implies that if an event handler issues a call that blocks, the entire server will do just that: block until the call completes. When the event loop blocks, the system sits idle, and thus is a huge potential waste of resources. We thus have a rule that must be obeyed in event-based systems: no blocking calls are allowed.

33.6 A Solution: Asynchronous I/O

例子: many modern operating systems have introduced new ways to issue I/O requests to the disk system, referred to generically as asynchronous I/O.
Mac OS X (other systems have similar APIs). the struct aiocb or AIO control block in common terminology.

1
2
3
4
5
6
struct aiocb {
int aio_fildes; /* File descriptor */
off_t aio_offset; /* File offset */
volatile void *aio_buf; /* Location of buffer */
size_t aio_nbytes; /* Length of transfer */
};

usage

  • To issue an asynchronous read to a file, an application should first fill in this structure with the relevant information: the file descriptor of the file to be read (aio_fildes), the offset within the file (aio_offset) as well as the length of the request (aio_nbytes), and finally the target memory location into which the results of the read should be copied (aio_buf).
  • After this structure is filled in, the application must issue the asynchronous call to read the file; on Mac OS X, this API is simply the asynchronous read API: int aio_read(struct aiocb *aiocbp);
  • How can we tell when an I/O is complete, and thus that the buffer (pointed to by aio_buf) now has the requested data within it?
    int aio_error(const struct aiocb *aiocbp);
    an application can periodically poll the system via a call to aio_error() to determine whether said I/O has yet completed.

问题:
It is painful to check whether an I/O has completed; if a program has tens or hundreds of I/Os issued at a given point in time, should it simply keep checking each of them repeatedly, or wait a little while first, or … ?

解决:
To remedy this issue, some systems provide an approach based on the interrupt. This method uses UNIX signals to inform applications when an asynchronous I/O completes, thus removing the need to repeatedly ask the system.
In systems without asynchronous I/O, the pure event-based approach cannot be implemented.

ASIDE: UNIX SIGNALS

  • At its simplest, signals provide a way to communicate with a process.
  • Specifically, a signal can be delivered to an application; doing so stops the application from whatever it is doing to run a signal handler, i.e., some code in the application to handle that signal. When finished, the process just resumes its previous behavior.

33.7 Another Problem: State Management

Another issue with the event-based approach is that such code is generally more complicated to write than traditional thread-based code.

The reason is as follows: when an event handler issues an asynchronous I/O, it must package up some program state for the next event handler to use when the I/O finally completes; this additional work is not needed in thread-based programs.

解决: To use an old programming language construct known as a continuation. Though it sounds complicated, the idea is rather simple: basically, record the needed information to finish processing this event in some data structure; when the event happens (i.e., when the disk I/O completes), look up the needed information and process the event.

33.8 What Is Still Difficult With Events

  1. when systems moved from a single CPU to multiple CPUs, some of the simplicity of the event-based approach disappeared.
    in order to utilize more than one CPU, the event server has to run multiple event handlers in parallel; when doing so, the usual synchronization problems (e.g., critical sections) arise, and the usual solutions (e.g., locks) must be employed.

  2. it does not integrate well with certain kinds of systems activity, such as paging.
    if an event-handler page faults, it will block, and thus the server will not make progress until the page fault completes. Even though the server has been structured to avoid explicit blocking, this type of implicit blocking due to page faults is hard to avoid and thus can lead to large performance problems when prevalent.

  3. event-based code can be hard to manage over time, as the exact semantics of various routines changes.

  4. though asynchronous disk I/O is now possible on most platforms, it has taken a long time to get there.

33.9 Summary

Both threads and events are likely to persist as two different approaches to the same concurrency problem for many years to come.

Note for 《Operating Systems Three Easy Pieces》 Part 2 Concurrency

https://www.chuxin911.com/Operating_Systems_Three_Easy_Pieces_part2_20220527/

作者

cx

发布于

2022-05-27

更新于

2022-10-18

许可协议