Note for 《Operating Systems Three Easy Pieces》 Part 1-1 Virtulization on CPU

Note for 《Operating Systems Three Easy Pieces》 Part 1-1 Virtulization on CPU

[TOC]

本文为《Operating Systems Three Easy Pieces》by Remzi H Arpaci-Dusseau / Andrea C Arpaci-Dusseau
第一部分中 CPU 虚拟化的学习笔记.

Chapter 4 The Abstraction: The Process

本章概要

  • 介绍什么是进程, OS 在创建进程时做了什么.
  • 进程的状态, 操控进程的 API 等.

    4.0 Introduction

The definition of a process, informally, is quite simple: it is a running program.

THE CRUX OF THE PROBLEM: HOW TO PROVIDE THE ILLUSION OF MANY CPUS?
Although there are only a few physical CPUs available, how can the OS provide the illusion of a nearly-endless supply of said CPUs?

Basic technique, known as time sharing of the CPU. the potential cost is performance, as each will run more slowly if the CPU(s) must be shared.

OS will need both some low-level machinery as well as some high-level intelligence.

  • low-level machinery mechanisms are low-level methods or protocols that implement a needed piece of functionality. For example, context switch.
  • high-level machinery policies are algorithms for making some kind of decision within the OS.

4.1 The Abstraction: A Process

machine state: what a program can read or update when it is running.

  • memory: the memory that the process can address (called its address space) is part of the process.

  • registers

    • program counter (PC) (sometimes
      called the instruction pointer or IP) tells us which instruction of the program is currently being executed;
    • a stack pointer and associated frame pointer are used to manage the stack for function parameters, local variables, and return addresses.
  • I/O information

4.2 Process API

what must be included in any interface of an operating system.

  • Create

  • Destroy

  • Wait

    • to wait for a process to stop running.
  • Miscellaneous Control

    • e.g. suspend and resume.
  • Status

4.3 Process Creation: A Little More Detail

  • load its code and any static data (e.g., initialized variables) into memory, into the address space of the process.

    • In early (or simple) operating systems, the loading process is done eagerly, i.e., all at once before running the program; modern OSes perform the process lazily, i.e., by loading pieces of code or data only as they are needed during program execution.
  • Some memory must be allocated for the program’s run-time stack (or just stack).

  • The OS will also likely initialize the stack with arguments; specifically, it will fill in the parameters to the main() function, i.e., argc and the argv array.

  • The OS may also create some initial memory for the program’s heap(for dynamically-allocated data).

  • The OS will also do some other initialization tasks, particularly as related to input/output (I/O).

    • For example, in UNIX systems, each process by default has three open file descriptors, for standard input, output, and error;

4.4 Process States

  • Running

    • it is executing instructions.
  • Ready

    • a process is ready to run but for some reason the OS has chosen not to run it at this given moment.
  • Blocked

    • a process has performed some kind of operation that makes it not ready to run until some other event takes place.
  • Process: State Transitions:

4.5 Data Structures

  • OS likely will keep some kind of process list for all processes that are ready, as well as some additional information to track which process is currently running.

  • The OS must also track, in some way, blocked processes; when an I/O event completes, the OS should make sure to wake the correct process and ready it to run again.

  • initial state

    • the process is in when it is being created.
  • final state

    • it has exited but has not yet been cleaned up (in UNIX-based systems, this is called the zombie state).
    • This final state can be useful as it allows other processes (usually the parent that created the process) to examine the return code.
  • Sometimes people refer to the individual structure that stores information about a process as a Process Control Block (PCB).

Chapter 5 Interlude: Process API

本章概要

  • 介绍了 OS 的 API, 例如 fork(), wait(), exec(), pipe(), kill() 等.

5.0 Intro

CRUX: HOW TO CREATE AND CONTROL PROCESSES
What interfaces should the OS present for process creation and control?
How should these interfaces be designed to enable ease of use as well as utility?

5.1 The fork() System Call

复制进程.

5.2 Adding wait() System Call

  • a parent to wait for a child process to finish what it has been doing.
  • its more complete sibling waitpid().

5.3 Finally, the exec() System Call

  • to run a program that is different from the calling program.
  • it does not create a new process; rather, it transforms the currently running program into a different running program.

5.4 Why? Motivating the API

  • it lets the shell run code after the call to fork() but before the call to exec(); this code can alter the environment of the about-to-be-run program, and thus enables a variety of interesting features to be readily built.

    • 例子: prompt> wc p3.c > newfile.txt: The way the shell accomplishes this task is quite simple: when the child is created, before calling exec(), the shell closes standard output and opens the file newfile.txt.
  • The shell is just a user program(集 API 大成).

    • It shows you a prompt and then waits for you to type something into it. You then type a command (i.e., the name of an executable program, plus any arguments) into it; in most cases, the shell then figures out where in the file system the executable resides, calls fork() to create a new child process to run the command, calls some variant of exec() to run the command, and then waits for the command to complete by calling wait(). When the child completes, the shell returns from wait() and prints out a prompt again, ready for your next command.
  • UNIX pipes are implemented in a similar way, but with the pipe() system call.

  • code sample

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <string.h>
    #include <fcntl.h>
    #include <assert.h>
    #include <sys/wait.h>

    int
    main(int argc, char *argv[])
    {
    int rc = fork();
    if (rc < 0) {
    // fork failed; exit
    fprintf(stderr, "fork failed\n");
    exit(1);
    } else if (rc == 0) {
    // child: redirect standard output to a file
    close(STDOUT_FILENO);
    open("./p4.output", O_CREAT|O_WRONLY|O_TRUNC, S_IRWXU);

    // now exec "wc"...
    char *myargs[3];
    myargs[0] = strdup("wc"); // program: "wc" (word count)
    myargs[1] = strdup("p4.c"); // argument: file to count
    myargs[2] = NULL; // marks end of array
    execvp(myargs[0], myargs); // runs word count
    } else {
    // parent goes down this path (original process)
    int wc = wait(NULL);
    assert(wc >= 0);
    }
    return 0;
    }

执行输出

1
2
3
4
5
6
prompt> ./p4
prompt> cat p4.output
32
109
846 p4.c
prompt>

5.5 Other Parts of the API

  • kill() system call is used to send signals to a process, including directives to go to sleep, die, and other useful imperatives.`

Chapter 6 Mechanism: Limited Direct Execution

本章概要

  • 介绍了为了安全建立的 kernel mode 与 user mode 的区分.
  • 利用 trap 接口可以从 user mode 进入 kernel mode.
  • 介绍了进程切换的 context switch, 以及相关的 registers.

6.0 Intro

THE CRUX: HOW TO EFFICIENTLY VIRTUALIZE THE CPU WITH CONTROL

6.1 Basic Technique: Limited Direct Execution

Direction Execution Protocol (Without Limits)

最原始的执行过程:

$$ \begin{array}{c|c} \textbf{OS} & \textbf{Program} \\ \hline \text{Create entry for process list} & \text{} \\ \text{Allocate memory for program} & \text{} \\ \text{Load program into memory} & \text{} \\ \text{Set up stack with argc/argv} & \text{} \\ \text{Clear registers} & \text{} \\ \text{Execute} \textbf{ call } \text{main()} & \text{} \\ \text{} & \text{Run main()} \\ \text{} &\text{Execute} \textbf{ return } \text{from main} \\ \text{Free memory of process} & \text{} \\ \text{Remove from process list} & \text{} \\ \end{array} $$
  • this approach gives rise to a few problems
    • how can the OS make sure the program doesn’t do anything that we don’t want it to do, while still running it efficiently?
    • when we are running a process, how does the operating system stop it from running and switch to another process, thus implementing the time sharing we require to virtualize the CPU?
  • advantage: fast

6.2 Problem #1: Restricted Operations

THE CRUX: HOW TO PERFORM RESTRICTED OPERATIONS
A process must be able to perform I/O and some other restricted operations, but without giving the process complete control over the system. How can the OS and hardware work together to do so?

code that runs in user mode is restricted in what it can do.

  • For example, when running in user mode, a process can’t issue I/O requests; doing so would result in the processor raising an exception; the OS would then likely kill the process.

kernel mode, which the operating system (or kernel) runs in. In this mode, code that runs can do what it likes, including privileged operations such as issuing I/O requests and executing all types of restricted instructions.

如何从 user mode 到 kernel mode?

  • To execute a system call, a program must execute a special trap instruction. This instruction simultaneously jumps into the kernel and raises the privilege level to kernel mode; When finished, the OS calls a special return-from-trap instruction.

  • The hardware needs to be a bit careful when executing a trap, in that it must make sure to save enough of the caller’s register state.

  • how does the trap know which code to run inside the OS?

    • The kernel does so by setting up a trap table at boot time. When the machine boots up, it does so in privileged (kernel) mode, and thus is free to configure machine hardware as need be. One of the first things. the OS thus does is to tell the hardware what code to run when certain exceptional events occur.
    • The OS informs the hardware of the locations of these trap handlers.

Limited Direction Execution Protocol

$$ \begin{array}{c|c} \textbf{OS @ boot} & \textbf{Hardware} \\ \textbf{(kernel mode)} & \textbf{} \\ \hline \textbf{initialize trap table} & \text{} \\ \text{} & \text{remember address of...} \\ \text{} & \text{syscall handler} \end{array} $$ $$ \begin{array}{c|c|c} \textbf{OS @ boot} & \textbf{Hardware} & \textbf{Program}\\ \textbf{(kernel mode)} & \textbf{} & \textbf{(user mode)}\\ \hline \text{Create entry for process list} & \text{} & \text{}\\ \text{Allocate memory for program} & \text{} & \text{}\\ \text{Load program into memory} & \text{} & \text{}\\ \text{Setup user stack with argv} & \text{} & \text{}\\ \text{Fill kernel stack with reg/PC} & \text{} & \text{}\\ \textbf{return-from-trap} & \text{} & \text{}\\ \text{} & \text{restore regs from kernel stack} & \text{}\\ \text{} & \text{move to user mode} & \text{}\\ \text{} & \text{jump to main} & \text{}\\ \text{} & \text{} & \text{Run main()}\\ \text{} & \text{} & \text{...}\\ \text{} & \text{} & \text{Call system call}\\ \text{} & \text{} & \textbf{trap} \text{ into OS}\\ \text{} & \text{save regs to kernel stack} & \text{}\\ \text{} & \text{move to kernel mode} & \text{}\\ \text{} & \text{jump to trap handler} & \text{}\\ \text{Handle trap} & \text{} & \text{}\\ \text{ Do work of syscall} & \text{} & \text{}\\ \textbf{return-from-trap} & \text{} & \text{}\\ \text{} & \text{restore regs from kernel stack} & \text{}\\ \text{} & \text{move to user mode} & \text{}\\ \text{} & \text{jump to PC after trap} & \text{}\\ \text{} & \text{} & \text{...}\\ \text{} & \text{} & \text{return from main}\\ \text{} & \text{} & \textbf{trap} \text{ (via exit())}\\ \text{Free memory of process} & \text{} & \text{}\\ \text{Remove from process list} & \text{} & \text{} \end{array} $$

6.3 Problem #2: Switching Between Processes

THE CRUX: HOW TO REGAIN CONTROL OF THE CPU
How can the operating system regain control of the CPU so that it can switch between processes?

A Cooperative Approach: Wait For System Calls

the OS trusts the processes of the system to behave reasonably. Processes that run for too long are assumed to periodically give up the CPU so that the OS can decide to run some other task.
Most processes transfer control of the CPU to the OS quite frequently by making system calls.
Applications also transfer control to the OS when they do something illegal.

  • For example, if an application divides by zero, or tries to access memory that it shouldn’t be able to access, it will generate a trap to the OS.

    • in the cooperative approach, your only recourse when a process gets stuck in an infinite loop is to resort to the age-old solution to all problems in computer systems: reboot the machine.

A Non-Cooperative Approach: The OS Takes Control

timer interrupt: A timer device can be programmed to raise an interrupt every so many milliseconds; when the interrupt is raised, the currently running process is halted, and a pre-configured interrupt handler in the OS runs. At this point, the OS has regained control of the CPU, and thus can do what it pleases.

How to do?

  • the OS must inform the hardware of which code to run when the timer interrupt occurs; thus,at boot time.
  • during the boot sequence, the OS must start the timer, which is of course a privileged operation.
  • The timer can also be turned off.
  • Note that the hardware has some responsibility when an interrupt occurs, in particular to save enough of the state of the program that was running when the interrupt occurred such that a subsequent return-from-trap instruction will be able to resume the running program correctly.

Saving and Restoring Context

This decision is made by a part of the operating system known as the scheduler;

  • If the decision is made to switch, the OS then executes a low-level piece of code which we refer to as a context switch.
  • Limited Direction Execution Protocol (Timer Interrupt)
$$ \begin{array}{c|c} \textbf{OS @ boot} & \textbf{Hardware} \\ \textbf{(kernel mode)} & \textbf{} \\ \hline \textbf{initialize trap table} & \text{} \\ \text{} & \text{remember address of...} \\ \text{} & \text{syscall handler} \\ \text{} & \text{timer handler} \\ \textbf{start interrupt timer} & \text{} \\ \text{} & \text{start timer} \\ \text{} & \text{interrupt CPU in } X \text{ ms} \\ \end{array} $$ $$ \begin{array}{c|c|c} \textbf{OS @ boot} & \textbf{Hardware} & \textbf{Program}\\ \textbf{(kernel mode)} & \textbf{} & \textbf{(user mode)}\\ \hline \text{} & \text{} & \text{Process A}\\ \text{} & \textbf{timer interrupt} & \text{}\\ \text{ } & \text{save regs(A) to k-stack(A)} & \text{}\\ \text{} & \text{move to kernel mode} & \text{}\\ \text{} & \text{jump to trap handler} & \text{}\\ \text{Handle the trap} & \text{} & \text{}\\ \text{Call switch() routine} & \text{} & \text{}\\ \text{ save regs(A) to proc-struct(A)} & \text{} & \text{}\\ \text{ restore regs(B) from proc-struct(B)} & \text{} & \text{}\\ \text{ switch to k-stack(B)} & \text{} & \text{}\\ \textbf{return-from-trap (into B)} & \text{} & \text{...}\\ \text{} & \text{restore regs(B) from k-stack(B)} & \text{}\\ \text{} & \text{move to user mode} & \text{}\\ \text{} & \text{jump to B’s PC} & \text{}\\ \text{} & \text{} & \text{Process B}\\ \text{} & \text{} & \text{...} \end{array} $$

Note that there are two types of register saves/restores that happen during this protocol.

  1. When the timer interrupt occurs, the user register state of the running process is implicitly saved by the hardware, using the kernel stack of that process.
  2. When the OS decides to switch from A to B, the kernel register state is explicitly saved by the software (i.e., the OS), but this time into memory in the process structure of the process.

6.4 Worried About Concurrency?

  • the OS does indeed need to be concerned as to what happens if, during interrupt or trap handling, another interrupt occurs.

  • This, in fact, is the exact topic of the entire second piece of this book: concurrency.

  • HOW LONG CONTEXT SWITCHES TAKE: a tool called lmbench

  • disable interrupts during interrupt processing; doing so ensures that when one interrupt is being handled, no other one will be delivered to the CPU.

Chapter 7 Scheduling: Introduction

本章概要

  • 从底层的 mechanism 跳到 policy.
  • 介绍了从简单到逐渐复杂实用的 policies: FIFO, SJF, STCF, Round Robin.
  • 从 3 个方面评价 policy:turnaround time, fairness, response time.
  • I/O 情况下不要 overlap.

7.0 Intro

THE CRUX: HOW TO DEVELOP SCHEDULING POLICY(disciplines)
How should we develop a basic framework for thinking about scheduling policies?
What are the key assumptions? What metrics are important?
What basic approaches have been used in the earliest of computer systems?

7.1 Workload Assumptions

processes, sometimes called jobs.
理想状态下的假设, 后面一条条地打破, 变成复杂的现实情况, 并且逐步引入复杂的 policy.

  1. Each job runs for the same amount of time.
  2. All jobs arrive at the same time.
  3. All jobs only use the CPU (i.e., they perform no I/O)
  4. The run-time of each job is known.

7.2 Scheduling Metrics

  1. performance metric: The turnaround time of a job is defined as the time at which the job completes minus the time at which the job arrived in the system.
    $$
    T_{\text{turnaround}} = T_{\text{completion}} − T_{\text{arrival}}
    $$

特殊情况: assumed that all jobs arrive at the same time, $T_{\text{turnaround}} = T_{\text{completion}}$

  1. fairness metric: Jain’s Fairness Index.

7.3 First In, First Out (FIFO)

it is clearly very simple and thus easy to implement.
问题: convoy effect: where a number of relatively-short potential consumers of a resource get queued behind a heavyweight resource consumer. 如下图:

### 7.4 Shortest Job First (SJF)

SJF is indeed an optimal scheduling algorithm.
JF is a non-preemptive scheduler.

7.5 Shortest Time-to-Completion First (STCF)

Add preemption to SJF, known as the Shortest Time-to-Completion First (STCF) or Preemptive Shortest Job First (PSJF) scheduler.

思路: Any time a new job enters the system, it determines of the remaining jobs and new job, which has the least time left, and then schedules that one.

STCF is provably optimal.

引入新的 metric, 因为上面的 policy 都不擅长下面的 metric.

  • response time metric:
    $$
    T_{\text{response}} = T_{\text{firstrun}} − T_{\text{arrival}}
    $$

7.6 Round Robin

思路: instead of running jobs to completion, RR runs a job for a time slice (sometimes called a scheduling quantum) and then switches to the next job in the run queue.
It repeatedly does so until the jobs are finished. For this reason, RR is sometimes called timeslicing.

Note that the length of a time slice must be a multiple of the timer-interrupt period.

  • the length of the time slice is critical for RR. The shorter it is, the better the performance of RR under the response-time metric.
  • deciding on the length of the time slice presents a trade-off to a system designer, making it long enough to amortize the cost of switching without making it so long that the system is no longer responsive.

需要硬件支持: they build up a great deal of state in CPU caches, TLBs, branch predictors, and other on-chip hardware.

RR is indeed one of the worst policies if turnaround time is our metric. turnaround
time only cares about when jobs finish, RR is nearly pessimal, even worse than simple FIFO in many cases.

### 7.7 Incorporating I/O

it is blocked waiting for I/O completion.
Doing so allows for overlap, with the CPU being used by one process while waiting for the I/O of another process to complete

### 7.8 No More Oracle
  • OS usually knows very little about the length of each job(a priori knowledge).

Chapter 8 Scheduling:The Multi-Level Feedback Queue

本章概要

  • 利用历史信息对进程排队-MLFQ, 并介绍了 5 个 rules 设置进程在 queue 中的优先级.
  • 排队不善可能导致的后果: 无法保障交互性的进程, 用时较短, 但需要的 response time 较短. starvation, 有些进程一直无法排上队. game scheduling, 如何防止被人操控队列.
  • 应对这些问题采取的手段: 过段时间重置队列, keep track on process, varying time-slice length, reserve the highest priority levels for operating system work, allow some user advice to help set priorities等.

8.0 Intro

THE CRUX:
HOW TO SCHEDULE WITHOUT PERFECT KNOWLEDGE?
How can we design a scheduler that both minimizes response time for interactive jobs while also minimizing turnaround time without a priori knowledge of job length?

8.1 MLFQ: Basic Rules

The MLFQ has a number of distinct queues, each assigned a different priority level. At any given time, a job that is ready to run is on a single queue. MLFQ uses priorities to decide which job should run at a given time: a job with higher priority (i.e., a job on a higher queue) is chosen to run.

MLFQ varies the priority of a job based on its observed behavior. use the history of the job to predict its future behavior.

first two basic rules for MLFQ:

  • Rule 1: If Priority(A) > Priority(B), A runs (B doesn’t).
  • Rule 2: If Priority(A) = Priority(B), A & B run in RR.

8.2 Attempt #1: How to Change Priority

priority-adjustment algorithm:

  • Rule 3: When a job enters the system, it is placed at the highest priority (the topmost queue).
  • Rule 4a: If a job uses up an entire time slice while running, its priority is reduced (i.e., it moves down one queue).
  • Rule 4b: If a job gives up the CPU before the time slice is up, it stays at the same priority level.
    • The intent of this rule is simple: if an interactive job, for example, is doing a lot of I/O (say by waiting for user input from the keyboard or mouse), it will relinquish the CPU before its time slice is complete; in such case, we don’t wish to penalize the job and thus simply keep it at the same level.

原理: it doesn’t know whether a job will be a short job or a long-running job, it first assumes it might be a short job, thus giving the job high priority. If it actually is a short job, it will run quickly and complete; if it is not a short job, it will slowly move down the queues, and thus soon prove itself to be a long-running more batch-like process. In this manner, MLFQ approximates SJF.

存在的问题: Problems With Our Current MLFQ

    1. starvation: if there are “too many” interactive jobs in the system, they will combine to consume all CPU time, and thus long-running jobs will never receive any CPU time (they starve).
    1. a smart user could rewrite their program to game the scheduler.
    1. a program may change its behavior over time; what was CPU-bound may transition to a phase of interactivity. With our current approach, such a job would be out of luck and not be treated like the other interactive jobs in the system.

8.3 Attempt #2: The Priority Boost

Rule 5: After some time period $S$, move all the jobs in the system to the topmost queue.

Our new rule solves two problems at once.

  1. processes are guaranteed not to starve.
  2. if a CPU-bound job has become interactive, the scheduler treats it properly once it has received the priority boost.

$S$: voo-doo constants: they seemed to require some form of black magic to set them correctly. Unfortunately, $S$ has that flavor. If it is set too high, long-running jobs could starve; too low, and interactive jobs may not get a proper share of the CPU. 可能需要手动 tuning.

8.4 Attempt #3: Better Accounting

Instead of forgetting how much of a time slice a process used at a given level, the scheduler should keep track; once a process has used its allotment, it is demoted to the next priority queue.

  • Rule 4: Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).

TIP: AVOID VOO-DOO CONSTANTS (OUSTERHOUT’S LAW)
Why? => The frequent result: a configuration file filled with default parameter values that a seasoned administrator can tweak when something isn’t quite working correctly.

8.5 Tuning MLFQ And Other Issues

how to parameterize such a scheduler?

most MLFQ variants allow for varying time-slice length across different queues. The high-priority queues are usually given short time slices.

The Solaris MLFQ implementation provides a set of tables that determine exactly how the priority of a process is altered throughout its lifetime, how long each time slice is, and how often to boost the priority of a job.

Other MLFQ schedulers adjust priorities using mathematical formulae:
decay-usage algorithms.

many schedulers have a few other features that you might encounter.

  • some schedulers reserve the highest priority levels for operating system work
  • Some systems also allow some user advice to help set priorities; for example, by using the command-line utility nice you can increase or decrease the priority of a job

TIP: USE ADVICE WHERE POSSIBLE

8.6 MLFQ: Summary

refined set of MLFQ rules:

  • Rule 1: If Priority(A) > Priority(B), A runs (B doesn’t).
  • Rule 2: If Priority(A) = Priority(B), A & B run in RR.
  • Rule 3: When a job enters the system, it is placed at the highest priority (the topmost queue).
  • Rule 4: Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).
  • Rule 5: After some time period S, move all the jobs in the system to the topmost queue.

Chapter 9 Scheduling: Proportional Share

本章概要

  • 不设计进程的绝对运行时间而是考虑相对 CPU 占有率.
  • 2 种设计思想: lottery and stride scheduling. 前者依赖 randomness, 后者依赖 global setting for process.
  • share 或者叫 ticket 还可以在进程间流通, flation: 进程自己超发自己的 ticket. 仿佛货币一般.
  • 衡量防止 starvation 的 unfairness metric.
  • 为什么 Proportional Share 没有大规模应用的原因.

9.0 Intro

Proportional-share(fair-share) is based around a simple concept: instead of optimizing for turnaround or response time, a scheduler might instead try to guarantee that each job obtain a certain percentage of CPU time.

An excellent modern example: lottery scheduling: every so often, hold a lottery to determine which process should get to run next; processes that should run more often should be given more chances to win the lottery.

CRUX: HOW TO SHARE THE CPU PROPORTIONALLY
How can we design a scheduler to share the CPU in a proportional manner?
What are the key mechanisms for doing so?
How effective are they?

9.1 Basic Concept: Tickets Represent Your Share

tickets are used to represent the share of a resource that a process (or user or whatever) should receive. The percent of tickets that a process has represents its share of the system resource in question.

TIP: USE RANDOMNESS

  • First, random often avoids strange corner-case behaviors that a more traditional algorithm may have trouble handling.
  • Second, random also is lightweight, requiring little state to track alternatives.
  • Finally, random can be quite fast.

If you are ever in need of a mechanism to represent a proportion of ownership, this concept just might be the ticket.

9.2 Ticket Mechanisms

ticket currency allows a user with a set of tickets to allocate tickets among their own jobs in whatever currency they would like; the system then automatically converts said currency into the correct global value.

With ticket transfers, a process can temporarily hand off its tickets to another process. This ability is especially useful in a client/server setting.

With ticket inflation, a process can temporarily raise or lower the number of tickets it owns. It can be applied in an environment where a group of processes trust one another.

9.3 Implementation

To make this process most efficient, it might generally be best to organize the list in sorted order, from the highest number of tickets to the lowest.

9.4 An Example

unfairness metric $U$: which is simply the time the first job completes divided by the time that the second job completes.

when the job length is not very long, average unfairness can be quite severe. Only as the jobs run for a significant number of time slices does the lottery scheduler approach the desired outcome.

9.5 How To Assign Tickets?

One approach is to assume that the users know best. this solution is a non-solution.

9.6 Why Not Deterministic?

stride scheduling: a deterministic fair-share scheduler. Each job in the system has a stride, which is inverse in proportion to the number of tickets it has. every time a process runs, we will increment a counter for it (called its pass value) by its stride to track its global progress.

At any given time, pick the process to run that has the lowest pass value so far; when you run a process, increment its pass counter by its stride.

1
2
3
4
5
6
7
current = remove_min(queue);
// pick client with minimum pass
schedule(current);
// use resource for quantum
current->pass += current->stride; // compute next pass using stride
insert(queue, current);
// put back into the queue

Table: Stride Scheduling: A Trace

$$ \begin{array}{ccc|c} \begin{array}{c} \text { Pass(A) } \\ \text { (stride=100) } \end{array} & \begin{array}{c} \text { Pass(B) } \\ \text { (stride=200) } \end{array} & \begin{array}{c} \text { Pass(C) } \\ \text { (stride=40) } \end{array} & \text { Who Runs? } \\ \hline 0 & 0 & 0 & \text { A } \\ 100 & 0 & 0 & \text { B } \\ 100 & 200 & 0 & \text { C } \\ 100 & 200 & 40 & \text { C } \\ 100 & 200 & 80 & \text { C } \\ 100 & 200 & 120 & \text { A } \\ 200 & 200 & 120 & \text { C } \\ 200 & 200 & 160 & \text { C } \\ 200 & 200 & 200 & \ldots \end{array} $$

Lottery scheduling achieves the proportions probabilistically over time; stride scheduling gets them exactly right at the end of each scheduling cycle.

lottery scheduling has one nice property that stride scheduling does not: no global state. lottery makes it much easier to incorporate new processes in a sensible manner.

9.7 Summary

proportional-share schedulers have not achieved wide-spread adoption as CPU schedulers for a variety of reasons.

  • One is that such approaches do not particularly mesh well with I/O.
  • another is that they leave open the hard problem of ticket assignment.

proportional-share schedulers are more useful in domains where some of these problems (such as assignment of shares) are relatively easy to solve.

  • For example, in a virtualized data center, proportionally share memory in VMWare’s ESX Server.

Chapter 10 Multiprocessor Scheduling (Advanced)

本章概要

  • 从单核 scheduling 到多核 scheduling. 多核需要考虑 cache 的共享与同步问题.
  • cache 的设计来源于 locality 的概念, 包括 temporal locality 与 spatial locality.
  • 多核下 scheduling 有 Single-Queue Scheduling 和 Multi-Queue Scheduling.
  • 需要考虑的问题有 Single-Queue 下的 cache affinity 以及 Multi-Queue 下的 load imbalance. 及各自的解决方案.
  • 最后介绍了 Linux 下的多核 scheduling 的三种实现.

10.0 Intro

CRUX: HOW TO SCHEDULE JOBS ON MULTIPLE CPUS
How should the OS schedule jobs on multiple CPUs?
What new problems arise?
Do the same old techniques work, or are new ideas required?

10.1 Background: Multiprocessor Architecture

difference between single-CPU hardware and multi-CPU hardware centers around the use of hardware caches, and exactly how data is shared across multiple processors.

Caches are thus based on the notion of locality, of which there are two kinds: temporal locality and spatial locality.

  • temporal locality: when a piece of data is accessed, it is likely to be accessed again in the near future.
  • spatial locality: if a program accesses a data item at address $x$, it is likely to access data items near $x$ as well.

Because locality of these types exist in many programs, hardware systems can make good guesses about which data to put in a cache and thus work well.

problem of cache coherence

The basic solution is provided by the hardware: by monitoring memory accesses, hardware can ensure that basically the “right thing” happens and that the view of a single shared memory is preserved.

  • One way to do this on a bus-based system is to use an old technique known as bus snooping; each cache pays attention to memory updates by observing the bus that connects them to main memory. When a CPU then sees an update for a data item it holds in its cache, it will notice the change and either invalidate its copy (i.e., remove it from its own cache) or update it (i.e., put the new value into its cache too).

10.2 Don’t Forget Synchronization

to make such routines correct via locking.

10.3 One Final Issue: Cache Affinity

a multiprocessor scheduler should consider cache affinity when making its scheduling decisions, perhaps preferring to keep a process on the same CPU if at all possible.

10.4 Single-Queue Scheduling

putting all jobs that need to be scheduled into a single queue; we call this single-queue multiprocessor scheduling or SQMS for short.

  • This approach has the advantage of simplicity
  • obvious shortcomings
    • a lack of scalability: Locks, unfortunately, can greatly reduce performance, particularly as the number of CPUs in the systems grows.
    • cache affinity
      • most SQMS schedulers include some kind of affinity mechanism to try to make it more likely that process will continue to run on the same CPU if possible.
        • without cache affinity

        • with cache affinity

10.5 Multi-Queue Scheduling

some systems opt for multiple queues, e.g., one per CPU. We call this approach multi-queue multiprocessor scheduling (or MQMS).

Each queue will likely follow a particular scheduling discipline.

When a job enters the system, it is placed on exactly one scheduling queue, according to some heuristic (e.g., random, or picking one with fewer jobs than others).

Then it is scheduled essentially independently, thus avoiding the problems of information sharing and synchronization

advantages:

  • it should be inherently more scalable. As the number of CPUs grows, so too does the number of queues, and thus lock and cache contention should not become a central problem.
  • MQMS intrinsically provides cache affinity.

problem:

  • load imbalance

    load imbalance

CRUX: HOW TO DEAL WITH LOAD IMBALANCE
How should a multi-queue multiprocessor scheduler handle load imbalance, so as to better achieve its desired scheduling goals?

  1. By migrating a job from one CPU to another, true load balance can be achieved.
  2. work stealing
  • With a work-stealing approach, a (source) queue that is low on jobs will occasionally peek at another (target) queue, to see how full it is. If the target queue is (notably) more full than the source queue, the source will “steal” one or more jobs from the target to help balance load.
  • there is a natural tension in such an approach. If you look around at other queues too often, you will suffer from high overhead and have trouble scaling. If, on the other hand, you don’t look at other queues very often, you are in danger of suffering from severe load balances.

10.6 Linux Multiprocessor Schedulers

three different schedulers arose: the O(1) scheduler, the Completely Fair Scheduler (CFS), and the BF Scheduler (BFS).

  • O(1)

    • multiple queues
    • a priority-based scheduler (similar to the MLFQ discussed before)
    • interactivity is a particular focus.
  • CFS

    • multiple queues
    • a deterministic proportional-share approach more like Stride scheduling.
  • BFS

    • a single queue
    • proportional-share, but based on a more complicated scheme known as Earliest Eligible Virtual Deadline First (EEVDF)

Note for 《Operating Systems Three Easy Pieces》 Part 1-1 Virtulization on CPU

https://www.chuxin911.com/Operating_Systems_Three_Easy_Pieces_part1-1_20220527/

作者

cx

发布于

2022-05-27

更新于

2022-07-16

许可协议