Note for 《Operating Systems Three Easy Pieces》 Part 1-2 Virtulization on Memory

Note for 《Operating Systems Three Easy Pieces》 Part 1-2 Virtulization on Memory

[TOC]

本文为《Operating Systems Three Easy Pieces》by Remzi H Arpaci-Dusseau / Andrea C Arpaci-Dusseau
第二部分中 Memory 虚拟化的学习笔记.

Chapter 13 The Abstraction: Address Spaces

本章概要

  • 介绍了地址空间出现的背景.
  • 实现地址空间的三个主要原则: transparency, efficiency, protection.

13.1 Early Systems

13.2 Multiprogramming and Time Sharing

the era of multiprogramming and the era of time sharing.

One way to implement time sharing would be to run one process for a short while, giving it full access to all memory. then stop it, save all of its state to some kind of disk (including all of physical memory), load some other process’s state, run it for a while, and thus implement some kind of crude sharing of the machine.

it is way too slow, particularly as memory grew.

leave processes in memory while switching between them, allowing the OS to implement time sharing efficiently. Let us just assume those three components: code, stack, and heap.

needs: efficiency, interactivity, protection.

13.3 The Address Space

The address space of a process contains all of the memory state of the running program.

An Example Address Space

THE CRUX: HOW TO VIRTUALIZE MEMORY
How can the OS build this abstraction of a private, potentially large address space for multiple running processes (all sharing memory) on top of a single, physical memory?

TIP: THE PRINCIPLE OF ISOLATION

13.4 Goals

  1. transparency: the OS should implement virtual memory in a way that is invisible to the running program.
  2. efficiency: the OS should strive to make the virtualization as efficient as possible, both in terms of time (i.e., not making programs run much more slowly) and space (i.e., not using too much memory for structures needed to support virtualization).
  3. protection: The OS should make sure to protect processes from one another as well as the OS itself from processes.

any address you can see as a programmer of a user-level program is a virtual address.

Chapter 14 Interlude: Memory API

本章概要

  • 介绍了内存使用的 API: malloc(), free(), 以及 mmp(), calloc(), realloc().
  • 总结常见的内存使用错误: segmentation fault, buffer overflow, uninitialized read, memory leak, dangling pointer, double free 等.
  • malloc(), free() 的下一层: brk, sbrk.
  • tools: purify and valgrind.

14.0 Intro

CRUX: HOW TO ALLOCATE AND MANAGE MEMORY
In UNIX/C programs, understanding how to allocate and manage memory is critical in building robust and reliable software. What interfaces are commonly used?
What mistakes should be avoided?

14.1 Types of Memory

stack memory: allocations and deallocations of it are managed implicitly by the compiler for you, the programmer; for this reason it is sometimes called automatic memory.

  • if you want some information to live beyond the call invocation, you had better not leave that information on the stack.

heap memory: where all allocations and deallocations are explicitly handled by you, the programmer.

14.2 The malloc() Call

1
2
3
#include <stdlib.h>
...
void *malloc(size_t size);

malloc() returns a pointer to type void. The programmer further helps out by using what is called a cast.

14.3 The free() Call

the size of the allocated region is not passed in by the user, and must be tracked by the memory-allocation library itself.

14.4 Common Errors

  1. Forgetting To Allocate Memory
1
2
3
4
char *src = "hello";
char *dst;
// oops! unallocated
strcpy(dst, src); // segfault and die
  • segmentation fault
    YOU DID SOMETHING WRONG WITH MEMORY YOU FOOLISH PROGRAMMER AND I AM ANGRY.

修正

1
2
3
char *src = "hello";
char *dst = (char *) malloc(strlen(src) + 1);
strcpy(dst, src); // work properly
  1. Not Allocating Enough Memory
    not allocating enough memory, sometimes called a buffer overflow.

  2. Forgetting to Initialize Allocated Memory
    If you do forget, your program will eventually encounter an uninitialized read.

  3. Forgetting To Free Memory

memory leak

  • as slowly leaking memory eventually leads one to run out of memory, at which point a restart is required.
  • Note that not all memory need be freed, at least, in certain cases. For example, when you write a short-lived program, you might allocate some space using malloc(). The program runs and is about to complete: is there need to call free() a bunch of times just before exiting? While it seems wrong not to, it is in this case quite fine to simply exit.
  1. Freeing Memory Before You Are Done With It
    Sometimes a program will free memory before it is finished using it; such a mistake is called a dangling pointer.
    The subsequent use can crash the program, or overwrite valid memory.

  2. Freeing Memory Repeatedly

double free: The result of doing so is undefined.

  1. Calling free() Incorrectly
    When you pass in some other value, bad things can (and do) happen.

Summary
tools: purify and valgrind.

14.5 Underlying OS Support

malloc() and free() are not system calls, but rather library calls.
malloc library manages space within your virtual address space, but itself is built on top of some system calls. One such system call is called brk, which is used to change the location of the program’s break: the location of the end of the heap. Note that you should never directly call either brk or sbrk.

you can also obtain memory from the operating system via the mmap() call. By passing in the correct arguments, mmap() can create an anonymous memory region within your program – a region which is not associated with any particular file but rather with swap space.

14.6 Other Calls

calloc() allocates memory and also zeroes it before returning; this prevents some errors where you assume that memory is zeroed and forget to initialize it yourself
realloc() can also be useful, when you’ve allocated space for something (say, an array), and then need to add something to it: realloc() makes a new larger region of memory, copies the old region into it, and returns the pointer to the new region.

Chapter 15 Mechanism: Address Translation

本章概要

  • 介绍了 OS 如何构造地址空间, 也就是 address translation.
  • 通过 base and bounds 两个 register 实现整块的内存分配, 并实现 protection.
  • 分配有 static relocation 以及 dynamic relocation.
  • OS(MMU) 通过 free list 管理可用的内存.
  • OS 在分配内存中遇到的问题: 分配, 回收, 进程切换.

15.0 Intro

THE CRUX: HOW TO EFFICIENTLY AND FLEXIBLY VIRTUALIZE MEMORY
How can we build an efficient virtualization of memory?
How do we provide the flexibility needed by applications?
How do we maintain control over which memory locations an application can access, and thus ensure that application memory accesses are properly restricted?
How do we do all of this efficiently?

hardware-based address translation, or just address translation for short. With address translation, the hardware transforms each memory access (e.g., an instruction fetch, load, or store), changing the virtual address provided by the instruction to a physical address where the desired information is actually located.

15.1 Assumptions

理想下的假设:

  1. assume for now that the user’s address space must be placed contiguously in physical memory.
  2. assume, for simplicity, that the size of the address space is not too big; specifically, that it is less than the size of physical memory.
  3. assume that each address space is exactly the same size.

15.2 An Example

假设 C 代码:

1
2
3
4
void func()
int x;
...
x = x + 3; // this is the line of code we are interested in

对应的 assembly instructions

1
2
3
4
5
6
128: movl 0x0(%ebx), %eax
;load 0+ebx into eax
132: addl $0x03, %eax
;add 3 to eax register
135: movl %eax, 0x0(%ebx)
;store eax back to mem

对应的内存结构模型:
A Process And Its Address Space

Physical Memory with a Single Relocated Process

15.3 Dynamic (Hardware-based) Relocation

base and bounds: the technique is also referred to as dynamic relocation.
two hardware registers within each CPU: one is called the base register, and the other the bounds (sometimes called a limit register).

OS decides where in physical memory it should be loaded and sets the base register to that value.
physical address = virtual address + base

SOFTWARE-BASED RELOCATION

static relocation, in which a piece of software known as the loader takes an executable that is about to be run and rewrites its addresses to the desired offset in physical memory.
static relocation has numerous problems. First and most importantly, it does not provide protection.

Because this relocation of the address happens at runtime, and because we can move address spaces even after the process has started running, the technique is often referred to as dynamic relocation.

A bounds (or limit) register ensures that such addresses are within the confines of the address space. the processor will first check that the memory reference is within bounds to make sure it is legal;
If a process generates a virtual address that is greater than the bounds, or one that is negative, the CPU will raise an exception, and the process will likely be terminated.

Sometimes people call the part of the processor that helps with address translation the memory management unit (MMU); as we develop more sophisticated memory-management techniques, we will be adding more circuitry to the MMU.

A small aside about bound registers

  • it holds the size of the address space.
  • it holds the physical address of the end of the address space

free list
The OS must track which parts of free memory are not in use. Many different data structures can of course be used for such a task; the simplest (which we will assume here) is a free list, which simply is a list of the ranges of the physical memory which are not currently in use.

15.4 OS Issues

First, The OS must take action when a process is created, finding space for its address space in memory.

Second, the OS must take action when a process is terminated, reclaiming all of its memory for use in other processes or the OS.

Third, the OS must also take action when a context switch occurs.

  • the OS must save and restore the base-and-bounds pair when it switches between processes.
  • when the OS decides to stop running a process, it must save the values of the base and bounds registers to memory, in some per-process structure such as the process structure or process control block (PCB).
  • access to the base and bounds registers is obviously privileged.

15.5 Summary

Key to the efficiency of this technique is hardware support.

Base-and-bounds also offers protection.

the process stack and heap are not too big, all of the space between the two is simply wasted. This type of waste is usually called internal fragmentation, as the space inside the allocated unit is not all used (i.e., is fragmented) and thus wasted.

Chapter 16 Segmentation

本章概要

  • 介绍 base and bound 的进一步抽象: Segmentation, 将 stack, heap, code 进行分段.
  • 如何知道所属的 Segmentation? explicit approach: top few bits of the virtual address. implicit approach: hardware determines.
  • 内存增长方向: growing bit
  • 粒度的大小: Fine-grained vs. Coarse-grained Segmentation
  • Segmentation 的好处有 code sharing, protection bits 提供分级保护.
  • Segmentation 存在的问题: external fragmentation, 多次分配导致物理内存支离破碎, 不方便再次分配较大块内存出来. 有 free-list management algorithm 等算法.

16.0 Intro

THE CRUX: HOW TO SUPPORT A LARGE ADDRESS SPACE
How do we support a large address space with (potentially) a lot of free space between the stack and the heap?

16.1 Segmentation: Generalized Base/Bounds

we have three logically-different segments: code, stack, and heap. What segmentation allows the OS to do is to place each one of those segments in different parts of physical memory, and thus avoid filling physical memory with unused virtual address space.

Placing Segments In Physical Memory 图解

large address spaces with large amounts of unused address space (which we sometimes call sparse address spaces)

a set of three base and bounds register pairs.

$$ \begin{array}{lcc} \text { Segment } & \text { Base } & \text { Size } \\ \hline \text { Code } & 32 K & 2 K \\ \text { Heap } & 34 K & 2 K \\ \text { Stack } & 28 K & 2 K \end{array} $$

segmentation violation or segmentation fault: What if we tried to refer to an illegal address? the hardware detects that the address is out of bounds, traps into the OS, likely leading to the termination of the offending process. Humorously, the term persists, even on machines with no support for segmentation at all.

16.2 Which Segment Are We Referring To?

explicit approach

  • To chop up the address space into segments based on the top few bits of the virtual address; this technique was used in the VAX/VMS system.

  • the top two bits (01) tell the hardware which segment we are referring to. The bottom 12 bits are the offset into the segment.

1
2
3
4
5
6
7
8
9
10
// get top 2 bits of 14-bit VA
Segment = (VirtualAddress & SEG_MASK) >> SEG_SHIFT
// now get offset
Offset= VirtualAddress & OFFSET_MASK

if (Offset >= Bounds[Segment])
RaiseException(PROTECTION_FAULT)
else
PhysAddr = Base[Segment] + Offset
Register = AccessMemory(PhysAddr)

some systems put code in the same segment as the heap and thus use only one bit to select which segment to use.

implicit approach

the hardware determines the segment by noticing how the address was formed.

If, for example, the address was generated from the program counter (i.e., it was an instruction fetch), then the address is within the code segment;
If the address is based off of the stack or base pointer, it must be in the stack segment;
Any other address must be in the heap.

16.3 What About The Stack?

Instead of just base and bounds values, the hardware also needs to know which way
the segment grows
(a bit, for example, that is set to 1 when the segment grows in the positive direction, and 0 for negative).
Segment Registers (With Negative-Growth Support)

$$ \begin{array}{lccc} \text { Segment } & \text { Base } & \text { Size } & \text { Grows Positive? } \\ \hline \text { Code } & 32 K & 2 K & 1 \\ \text { Heap } & 34 K & 2 K & 1 \\ \text { Stack } & 28 K & 2 K & 0 \end{array} $$

16.4 Support for Sharing

Segment Register Values (with Protection)

$$ \begin{array}{lcccc} \text { Segment } & \text { Base } & \text { Size } & \text { Grows Positive? } & \text { Protection } \\ \hline \text { Code } & 32 K & 2 K & 1 & \text { Read-Execute } \\ \text { Heap } & 34 K & 2 K & 1 & \text { Read-Write } \\ \text { Stack } & 28 K & 2 K & 0 & \text { Read-Write } \end{array} $$

code sharing is common and still in use in systems today.
protection bits: Basic support adds a few bits per segment, indicating whether or not a program can read or write a segment, or perhaps execute code that lies within the segment.

16.5 Fine-grained vs. Coarse-grained Segmentation

coarse-grained: it chops up the address space into relatively large, coarse chunks.
fine-grained segmentation: address spaces to consist of a large number smaller segments
Supporting many segments requires even further hardware support, with a segment table of some kind stored in memory.

16.6 OS Support

what should the OS do on a context switch?

  • the segment registers must be saved and restored.

issue is managing free space in physical memory.

  • we have a number of segments per process, and each segment might be a different size.

实现 compact physical memory

external fragmentation

  • physical memory quickly becomes full of little holes of free space, making it difficult to allocate new segments, or to grow existing ones.

compact physical memory by rearranging the existing segments.

  • For example, the OS could stop whichever processes are running, copy their data to one contiguous region of memory, change their segment register values to point to the new physical locations, and thus have a large free extent of memory with which to work.
  • compaction is expensive

方法
A simpler approach is to use a free-list management algorithm

  • classic algorithms like best-fit (which keeps a list of free spaces and returns the one closest in size that satisfies the desired allocation to the requester).
  • worst-fit, first-fit, and more complex schemes like buddy algorithm.

TIP: IF 1000 SOLUTIONS EXIST, NO GREAT ONE DOES

The only real solution is to avoid the problem altogether, by never allocating memory in variable-sized chunks.

16.7 Summary

Beyond just dynamic relocation, segmentation can better support sparse address spaces, by avoiding the huge potential waste of memory between logical segments of the address space.

It is also fast, as doing the arithmetic segmentation requires in hardware is easy and well-suited to hardware; the overheads of translation are minimal.

A fringe benefit arises too: code sharing.

Allocating variable-sized segments in memory leads to some problems

  1. external fragmentation: One can try to use smart algorithms or periodically compact memory, but the problem is fundamental and hard to avoid.
  2. segmentation still isn’t flexible enough to support our fully generalized, sparse address space.
  • For example, if we have a large but sparsely-used heap all in one logical segment, the entire heap must still reside in memory in order to be accessed. In other words, if our model of how the address space is being used doesn’t exactly match how the underlying segmentation has been designed to support it, segmentation doesn’t work very well.

Chapter 17 Free-Space Management

本章概要

  • 介绍如何管理空闲的内存空间. 基本思想是 paging + Splitting and Coalescing.
  • 管理基于的数据结构是 list. 本章也介绍了 list node 以及 list 本身的结构.
  • 分配空余空间介绍了如下策略: Best Fit, Worst Fit, First Fit, Next Fit.
  • 更加复杂的数据结构的策略: Segregated Lists(多 lists), 基于树的 Buddy Allocation.

17.0 Intro

paging: it is easy when the space you are managing is divided into fixed-sized units; in such a case, you just keep a list of these fixed-sized units; when a client requests one of them, return the first entry.

CRUX: HOW TO MANAGE FREE SPACE
How should free space be managed, when satisfying variable-sized requests?
What strategies can be used to minimize fragmentation?
What are the time and space overheads of alternate approaches?

17.1 Assumptions

  1. We assume a basic interface such as that provided by malloc() and free().

  2. We further assume that primarily we are concerned with external fragmentation.Allocators could of course also have the problem of internal fragmentation.

    • if an allocator hands out chunks of memory bigger than that requested, any unasked for (and thus unused) space in such a chunk is considered internal fragmentation.
  3. We’ll also assume that once memory is handed out to a client, it cannot be relocated to another location in memory. Thus, no compaction of free space is possible.

  4. we’ll assume that the allocator manages a contiguous region of bytes. the region is a single fixed size throughout its life.

17.2 Low-level Mechanisms

Splitting and Coalescing

heap model

free list

Assume we have a request for just a single byte of memory. splitting: it will find a free chunk of memory that can satisfy the request and split it into two.

after splitting:

coalescing of free space

before:

after:

Tracking The Size Of Allocated Regions

most allocators store a little bit of extra information in a header block which is kept in memory, usually just before the handed-out chunk of memory.

The header minimally contains the size of the allocated region; it may also contain additional pointers to speed up deallocation, a magic number to provide additional integrity checking, and other information.
1
2
3
4
typedef struct __header_t {
int size;
int magic;
} header_t;

例子: Specific Contents Of The Header:

1
2
3
void free(void *ptr) {
header_t *hptr = (void *)ptr - sizeof(header_t);
}

After obtaining such a pointer to the header, the library can easily determine whether the magic number matches the expected value as a sanity check (assert(hptr->magic == 1234567)) and calculate the total size of the newly-freed region via simple math.

Embedding A Free List

code to build a heap

1
2
3
4
5
6
7
// mmap() returns a pointer to a chunk of free space
node_t *head = mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE, -1, 0);
head->size
= 4096 - sizeof(node_t);
head->next
= NULL;

Free Space With Three Chunks Allocated

Free Space With Two Chunks Allocated
A Non-Coalesced Free List

Growing The Heap

Most traditional allocators make some kind of system call (e.g., sbrk in most UNIX systems) to grow the heap, and then allocate the new chunks from there.

17.3 Basic Strategies

because the stream of allocation and free requests can be arbitrary (after all, they are determined by the programmer), any particular strategy can do quite badly given the wrong set of inputs.

  • Best Fit

    • first, search through the free list and find chunks of free memory that are as big or bigger than the requested size. Then, return the one that is the smallest in that group of candidates;
    • best fit tries to reduce wasted space.
    • there is a cost: 遍历的代价 naive implementations pay a heavy performance penalty when performing an exhaustive search for the correct free block.
  • Worst Fit

    • find the largest chunk and return the requested amount; keep the remaining (large) chunk on the free list.
    • Worst fit tries to thus leave big chunks free instead of lots of small chunks that can arise from a best-fit approach.
    • a full search of free space is required: costly.
    • Worse, most studies show that it performs badly.
  • First Fit

    • finds the first block that is big enough and returns the requested amount to the user. As before, the remaining free space is kept free for subsequent requests.
    • advantage of speed
    • sometimes pollutes the beginning of the free list with a small objects.
    • how the allocator manages the free list’s order becomes an issue. One approach is to use address-based ordering; by keeping the list ordered by the address of the free space, coalescing becomes easier, and fragmentation tends to be reduced.
  • Next Fit

    • the next fit algorithm keeps an extra pointer to the location within the list where one was looking last time.
    • The idea is to spread the searches for free space throughout the list more uniformly, thus avoiding splintering of the beginning of the list.

17.4 Other Approaches

  • Segregated Lists

    • if a particular application has one (or a few) popular-sized request that it makes, keep a separate list just to manage objects of that size; all other requests are forwarded to a more general memory allocator.

    • how much memory should one dedicate to the pool of memory that serves specialized requests of a given size, as opposed to the general pool?

      • the slab allocator: Specifically, when the kernel boots up, it allocates a number of object caches for kernel objects that are likely to be requested frequently (such as locks, file-system inodes, etc.); the object caches thus are each segregated free lists of a given size and serve memory allocation and free requests quickly. When a given cache is running low on free space, it requests some slabs of memory from a more general memory allocator (the total amount requested being a multiple of the page size and the object in question). Conversely, when the reference counts of the objects within a given slab all go to zero, the general allocator can reclaim them from the specialized allocator, which is often done when the VM system needs more memory.
    • The slab allocator also goes beyond most segregated list approaches by keeping free objects on the lists in a pre-initialized state.

  • Buddy Allocation

    • Because coalescing is critical for an allocator, some approaches have been designed around making coalescing simple. One good example is found in the binary buddy allocator.

    • example of a 64KB free space getting divided in the search for a 7KB block:

    • note that this scheme can suffer from internal fragmentation, as you are only allowed to give out power-of-two-sized blocks.

    • the address of each buddy pair only differs by a single bit; which bit is determined by the level in the buddy tree.

  • Other Ideas

    • One major problem with many of the approaches described above is their lack of scaling.
    • advanced allocators use more complex data structures to address these costs, trading simplicity for performance. Examples include balanced binary trees, splay trees, or partially-ordered trees.

Chapter 18 Paging: Introduction

本章概要

  • 介绍了 paging 的基本概念.
  • 对每个 process 有一个 page table 来管理其所用的 pages.
  • page table 可以变得非常大(page table entry (PTE)–> 2 层数据结构), 取用起来也比较慢.
  • page table 里的内容, 以及如何通过 PTE 找到 page table 进一步找到对应的物理内存地址. 并通过一个完整的案例展示了整个过程的细节.

18.0 Intro

THE CRUX: HOW TO VIRTUALIZE MEMORY WITHOUT SEGMENTS
How can we virtualize memory in a way as to avoid the problems of segmentation? What are the basic techniques?
How do we make those techniques work well?
we split up our address space into fixed-sized units we call a page.
With paging, physical memory is also split into some number of pages as well; we sometimes will call each page of physical memory a page frame.

相对于 segmentation 的 improvement

  • most important improvement will be flexibility: with a fully-developed paging approach, the system will be able to support the abstraction of an address space effectively, regardless of how the processes uses the address space.
  • Another advantage is the simplicity of free-space management that paging affords.

64-Byte Address Space Placed In Physical Memory:

the operating system keeps a per-process data structure known as a page table. The major role of the page table is to store address translations for each of the virtual pages of the address space.
It is important to remember that this page table is a per-process data structure.

To translate this virtual address that the process generated, we have to first split it into two components: the virtual page number (VPN), and the offset within the page.

For this example, because the virtual address space of the process is 64 bytes, we need 6 bits total for our virtual address ($2^6 = 64$)

The Address Translation Process

With our virtual page number, we can now index our page table and find which physical page(physical page number (PPN) (a.k.a. physical frame number or PFN)) that virtual page resides within.

18.1 Where Are Page Tables Stored?

Page tables can get awfully large

a typical 32-bit address space, with 4-KB pages. This virtual address splits into a 20-bit VPN and 12-bit offset.
A 20-bit VPN implies that there are $2^{20}$ translations that the OS would have to manage for each process (that’s roughly a million); assuming we need 4 bytes per page table entry (PTE) to hold the physical translation plus any other useful stuff, we get an immense 4MB of memory needed for each page table!

imagine there are 100 processes running: this means the OS would need 400MB of memory just for all those address translations!

we store the page table for each process in memory somewhere. assume for now that the page tables live in physical memory that the OS manages.

Example: Page Table in Kernel Physical Memory:

18.2 What’s Actually In The Page Table?

The simplest form is called a linear page table, which is just an array. The OS indexes the array by the VPN, and looks up the page-table entry (PTE) at that index in order to find the desired PFN.

  • A valid bit is common to indicate whether the particular translation is valid
  • protection bits, indicating whether the page could be read from, written to, or executed from.
  • A present bit indicates whether this page is in physical memory or on disk (swapped out)
  • A dirty bit is also common, indicating whether the page has been modified since it was brought into memory.
  • A reference bit (a.k.a. accessed bit) is sometimes used to track whether a page has been accessed, and is useful in determining which pages are popular and thus should be kept in memory; such knowledge is critical during page replacement.

An x86 Page Table Entry (PTE)

- It contains a present bit (P); a read/write bit (R/W), a user/supervisor bit (U/S) which determines if user-mode processes can access the page; a few bits (PWT, PCD, PAT, and G) that determine how hardware caching works for these pages; an accessed bit (A) and a dirty bit (D); and finally, the page frame number (PFN) itself.

18.3 Paging: Also Too Slow

example: movl 21, %eax

In this example, we will assume the hardware performs the translation for us. To fetch the desired data, the system must first translate the virtual address (21) into the correct physical address (117). Thus, before issuing the load to address 117, the system must first fetch the proper page table entry from the process’s page table, perform the translation, and then finally get the desired data from physical memory.

To do so, the hardware must know where the page table is for the currently-running process. Let’s assume for now that a single page-table base register contains the physical address of the starting location of the page table. To find the location of the desired PTE, the hardware will thus perform the following functions:

1
2
VPN = (VirtualAddress & VPN_MASK) >> SHIFT
PTEAddr = PageTableBaseRegister + (VPN * sizeof(PTE))

In our example, VPN_MASK would be set to 0x30 (hex 30, or binary 110000) which picks out the VPN bits from the full virtual address; SHIFT is set to 4 (the number of bits in the offset), such that we move the VPN bits down to form the correct integer virtual page number. For example, with virtual address 21 (010101), and masking turns this value into 010000; the shift turns it into 01, or virtual page 1, as desired. We then use this value as an index into the array of PTEs pointed to by the page table base register.

Once this physical address is known, the hardware can fetch the PTE from memory, extract the PFN, and concatenate it with the offset from the virtual address to form the desired physical address.

complete process code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Extract the VPN from the virtual address
VPN = (VirtualAddress & VPN_MASK) >> SHIFT
// Form the address of the page-table entry (PTE)
PTEAddr = PTBR + (VPN * sizeof(PTE))
// Fetch the PTE
PTE = AccessMemory(PTEAddr)
// Check if process can access the page
if (PTE.Valid == False)
RaiseException(SEGMENTATION_FAULT)
else if (CanAccess(PTE.ProtectBits) == False)
RaiseException(PROTECTION_FAULT)
else
// Access is OK: form physical address and fetch it
offset = VirtualAddress & OFFSET_MASK
PhysAddr = (PTE.PFN << PFN_SHIFT) | offset
Register = AccessMemory(PhysAddr)

In general, a page table stores virtual-to-physical address translations, thus letting the system know where each page of an address space actually resides in physical memory. The exact structure of the page table is either determined by the hardware (older systems) or can be more flexibly managed by the OS (modern systems).

结论: Without careful design of both hardware and software, page tables will cause the system to run too slowly, as well as take up too much memory.

18.4 A Memory Trace

c code

1
2
3
4
int array[1000];
...
for (i = 0; i < 1000; i++)
array[i] = 0;

assemble

1
2
3
4
0x1024 movl $0x0,(%edi,%eax,4)
0x1028 incl %eax
0x102c cmpl $0x03e8,%eax
0x1030 jne 0x1024
  • The first instruction moves the value zero (shown as $0x0) into the virtual memory address of the location of the array; this address is computed by taking the contents of %edi and adding %eax multiplied by four to it.

    • %edi holds the base address of the array, whereas %eax holds the array index (i).

    • We multiply by four because the array is an array of integers, each size four bytes(for simplicity; in actuality, x86 instructions are variable-sized).

  • The second instruction increments the array index held in %eax.

  • The third instruction compares the contents of that register to the hex value 0x03e8, or decimal 1000. If the comparison shows that that two values are not yet equal (which is what the jne instruction tests), the fourth instruction jumps back to the top of the loop.

例子的 setting

We assume a virtual address space of size 64 KB (unrealistically small). We also assume a page size of 1 KB. we have a linear (array-based) page table and that it is located at physical address 1 KB (1024). Let’s assume this virtual page maps to physical frame 4 (VPN 1 → PFN 4). Next, there is the array itself. Its size is 4000 bytes (1000 integers), and it lives at virtual addresses 40000 through 44000 (not including the last byte). The virtual pages for this decimal range is VPN=39 … VPN=42. Thus, we need mappings for these pages. Let’s assume these virtual-to-physical mappings for the example: (VPN 39 → PFN 7), (VPN 40 → PFN 8), (VPN 41 → PFN 9), (VPN 42 → PFN 10).

  • trace the memory references of the program

    • When it runs, each instruction fetch will generate two memory references: one to the page table to find the physical frame that the instruction resides within, and one to the instruction itself to fetch it to the CPU for processing.
    • there is one explicit memory reference in the form of the mov instruction; this adds another page table access first (to translate the array virtual address to the correct physical one) and then the array access itself.
  • The entire process, for the first five loop iterations

    • there are 10 memory accesses per loop,
      which includes four instruction fetches, one explicit update of memory, and five page table accesses to translate those four fetches and one explicit update.

Chapter 19 Paging: Faster Translations (TLBs)

本章概要

  • 因为直接通过 PTE 进行内存管理会导致 PTE 所占内存太大, 并且搜索下来也比较慢, 引入 TLB 动态地进行管理. 首先介绍 TLB 实现的过程.
  • miss 时需要扩容的时候的过程, software-managed TLB.
  • 动态管理的后果之一就是要实现 context switch. 2 种方法, 一种是直接擦除, 另一种是像 PID 一样管理 TLB 的 address space identifier (ASID).
  • TLB entry 需要被替换(register 里空间不足)的 policy: RLU, random.
  • A MIPS TLB Entry 介绍了 TLB 一条 entry 的内容.
  • 大 page 的应用: database management system (DBMS)

19.0 Intro

THE CRUX: HOW TO SPEED UP ADDRESS TRANSLATION
How can we speed up address translation, and generally avoid the extra memory reference that paging seems to require?
What hardware support is required?
What OS involvement is needed?

a translation-lookaside buffer, or TLB is part of the chip’s memory-management unit (MMU), and is simply a hardware cache of popular virtual-to-physical address translations; thus, a better name would be an address-translation cache.

19.1 TLB Basic Algorithm

TLB Control Flow Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
VPN = (VirtualAddress & VPN_MASK) >> SHIFT
(Success, TlbEntry) = TLB_Lookup(VPN)
if (Success == True) // TLB Hit
if (CanAccess(TlbEntry.ProtectBits) == True)
Offset = VirtualAddress & OFFSET_MASK
PhysAddr = (TlbEntry.PFN << SHIFT) | Offset
AccessMemory(PhysAddr)
else
RaiseException(PROTECTION_FAULT)
else // TLB Miss
PTEAddr = PTBR + (VPN * sizeof(PTE))
PTE = AccessMemory(PTEAddr)
if (PTE.Valid == False)
RaiseException(SEGMENTATION_FAULT)
else if (CanAccess(PTE.ProtectBits) == False)
RaiseException(PROTECTION_FAULT)
else
TLB_Insert(VPN, PTE.PFN, PTE.ProtectBits)
RetryInstruction()

TLB is found near the processing core and is designed to be quite fast. When a miss occurs, the high cost of paging is incurred;

it is our hope to avoid TLB misses as much as we can.

19.2 Example: Accessing An Array

例子的 setting
let’s assume we have an array of 10 4-byte integers in memory, starting at virtual address 100. Assume further that we have a small 8-bit virtual address space, with 16-byte pages; thus, a virtual address breaks down into a 4-bit VPN (there are 16 virtual pages) and a 4-bit offset (there are 16 bytes on each of those pages).

An Array In A Tiny Address Space

c code to accesses each array element

1
2
3
4
int sum = 0;
for (i = 0; i < 10; i++) {
sum += a[i];
}

过程详解

  • the first time the program accesses the array, the result will be a TLB miss.
  • The next access is to a[1], and there is some good news here: a TLB hit!
  • when the program accesses a[3], we encounter another TLB miss.

note the role that page size plays in this example. If the page size had simply been twice as big (32 bytes, not 16), the array access would suffer even fewer misses.
soon after this loop completes, accesses the array again, we’d likely see an even better result. TLB hit rate would be high because of temporal locality.

TIP: USE CACHING WHEN POSSIBLE
The idea behind hardware caches is to take advantage of locality in instruction and data references. There are usually two types of locality: temporal locality and spatial locality.
these properties depend on the exact nature of the program, and thus are not hard-and-fast laws but more like rules of thumb.
If you want a fast cache, it has to be small, as issues like the speed-of-light and other physical constraints become relevant.

19.3 Who Handles The TLB Miss?

In the olden days, the hardware had complex instruction sets (sometimes called CISC, for complex-instruction set computers). hardware-managed TLBs (via a page-table base register).
More modern architectures, RISC, software-managed TLB.

TLB Control Flow Algorithm (OS Handled)

1
2
3
4
5
6
7
8
9
10
11
VPN = (VirtualAddress & VPN_MASK) >> SHIFT
(Success, TlbEntry) = TLB_Lookup(VPN)
if (Success == True) // TLB Hit
if (CanAccess(TlbEntry.ProtectBits) == True)
Offset = VirtualAddress & OFFSET_MASK
PhysAddr = (TlbEntry.PFN << SHIFT) | Offset
Register = AccessMemory(PhysAddr)
else
RaiseException(PROTECTION_FAULT)
else // TLB Miss
RaiseException(TLB_MISS)

过程详解

On a TLB miss, the hardware simply raises an exception, which pauses the current instruction stream, raises the privilege level to kernel mode, and jumps to a trap handler. This trap handler is code within the OS that is written with the express purpose of handling TLB misses. When run, the code will lookup the translation in the page table, use special “privileged” instructions to update the TLB, and return from the trap; at this point, the hardware retries the instruction (resulting in a TLB hit).

When running the TLB miss-handling code, the OS needs to be extra careful not to cause an infinite chain of TLB misses to occur.

  • you could keep TLB miss handlers in physical memory (where they are unmapped and not subject to address translation), or reserve some entries in the TLB for permanently-valid translations and use some of those permanent translation slots for the handler code itself; these wired translations always hit in the TLB.

这种方法的 advantages

  • flexibility: the OS can use any data structure it wants to implement the page table, without necessitating hardware change.
  • simplicity: the hardware doesn’t have to do much on a miss; it raises an exception, and the OS TLB miss handler does the rest.

19.4 TLB Contents: What’s In There?

fully associative just means that any given translation can be anywhere in the TLB, and that the hardware will search the entire TLB in parallel to find the desired translation.
A typical TLB entry might look like this: VPN | PFN | other bits

  • valid bit
  • protection bits
  • address-space identifier
  • dirty bit

TLB VALID BIT $\ne$ PAGE TABLE VALID BIT

19.5 TLB Issue: Context Switches

when switching from one process to another, the hardware or OS (or both) must be careful to ensure that the about-to-be-run process does not accidentally use translations from some previously run process.

THE CRUX: HOW TO MANAGE TLB CONTENTS ON A CONTEXT SWITCH
When context-switching between processes, the translations in the TLB for the last process are not meaningful to the about-to-be-run process. What should the hardware or OS do in order to solve this problem?

解决:

  1. simply flush the TLB on context switches, thus emptying it before running the next process.

    • On a software-based system, this can be accomplished with an explicit (and privileged) hardware instruction.
    • with a hardware-managed TLB, the flush could be enacted when the page-table base register is changed (note the OS must change the PTBR on a context switch anyhow).
  2. some systems add hardware support to enable sharing of the TLB across context switches.

    • some hardware systems provide an address space identifier (ASID) field in the TLB. You can think of the ASID as a process identifier (PID), but usually it has fewer bits (e.g., 8 bits for the ASID versus 32 bits for a PID).
    • the hardware also needs to know which process is currently running in order to perform translations, and thus the OS must, on a context switch, set some privileged register to the ASID of the current process.
    • two different processes with two different VPNs that point to the same physical page: for example, when two processes share a page (a code page, for example).
$$ \begin{array}{c|c|c|c|c} \text { VPN } & \text { PFN } & \text { valid } & \text { prot } & \text { ASID } \\ \hline 10 & 101 & 1 & \mathrm{r}-\mathrm{x} & 1 \\ - & - & 0 & - & - \\ 50 & 101 & 1 & \mathrm{r}-\mathrm{x} & 2 \\ - & - & 0 & - & - \end{array} $$

19.6 Issue: Replacement Policy

cache replacement

THE CRUX: HOW TO DESIGN TLB REPLACEMENT POLICY
Which TLB entry should be replaced when we add a new TLB entry? The goal, of course, being to minimize the miss rate (or increase hit rate) and thus improve performance.

  1. One common approach is to evict the least-recently-used or LRU entry.
  2. Another typical approach is to use a random policy. Randomness sometimes makes a bad decision but has the nice property that there are not any weird corner case behaviors that can cause pessimal behavior.

19.7 A Real TLB Entry

A MIPS TLB Entry

  • VPN

    • The MIPS R4000 supports a 32-bit address space with 4KB pages. Thus, we would expect a 20-bit VPN and 12-bit offset in our typical virtual address. However, as you can see in the TLB, there are only 19 bits for the VPN; as it turns out, user addresses will only come from half the address space (the rest reserved for the kernel) and hence only 19 bits of VPN are needed.
  • PFN

    • The VPN translates to up to a 24-bit physical frame number (PFN), and hence can support systems with up to 64GB of (physical) main memory ($2^{24}$ 4KB pages).
  • global bit (G)

    • is used for pages that are globally-shared among processes. Thus, if the global bit is set, the ASID is ignored.
  • 8-bit ASID

    • which the OS can use to distinguish between address spaces.
  • 3 Coherence (C) bits

    • which determine how a page is cached by the hardware
  • dirty bit

    • which is marked when the page has been written to
  • valid bit

    • which tells the hardware if there is a valid translation present in the entry.
  • page mask field (not shown)

    • which supports multiple page sizes;
  • some of the 64 bits are unused

instructions to update the TLB.

  • TLBP: probes the TLB to see if a particular translation is in there;
  • TLBR: reads the contents of a TLB entry into registers;
  • TLBWI: replaces a specific TLB entry;
  • TLBWR: replaces a random TLB
    entry.
  • these instructions are privileged;

TIP: RAM ISN’T ALWAYS RAM (CULLER’S LAW)
Sometimes randomly accessing your address space, particular if the number of pages accessed exceeds the TLB coverage, can lead to severe performance penalties.

19.8 Summary

  • TLB coverage

    • if the number of pages a program accesses in a short period of time exceeds the number of pages that fit into the TLB, the program will generate a large number of TLB misses, and thus run quite a bit more slowly.
    • One solution, is to include support for larger page sizes.
    • Support for large pages is often exploited by programs such as a database management system (a DBMS).
  • TLB access can easily become a bottleneck in the CPU pipeline.

    • physically-indexed cache. With such a cache, address translation has to take place before the cache is accessed, which can slow things down quite a bit.
    • virtually-indexed cache solves some performance problems, but introduces new issues into hardware design as well.

Chapter 20 Paging: Smaller Tables

本章概要:

  • 如何让 page tables 占据的内存大小变小? 本章介绍了 5 种做法, 分析了其优缺点.
  1. 多种 page size, 根据经常需要扩容的 process 分更大的 page, 避免 update. 问题是 internal fragmentation.
  2. Hybrid Approach: Paging and Segments: 不常用的内存例如 code 分到不常更新的 page 里. 但是这种做法治标不治本, heap 的问题还是存在.
  3. Multi-level Page Tables: 用 page directory 管理 page table. 代价是管理成本增加. 拿时间换空间.
  4. 把所有的 page tables 汇总在一起为一个大的数据结构, 用数据结构的算法处理.
  5. Swapping the Page Tables to Disk

    20.0 Intro

  • for 1 process

    • Assume again a 32-bit address space ($2^{32}$ bytes), with 4KB ($2^{12}$ byte)
      pages and a 4-byte page-table entry. An address space thus has roughly one million virtual pages in it ( $\frac{2^{32}}{2^{12}}$ ); multiply by the page-table size and you see that our page table is 4MB in size.
  • With a hundred active processes (not uncommon on a modern system), we will be allocating hundreds of megabytes of memory just for page tables!

CRUX: HOW TO MAKE PAGE TABLES SMALLER?
Simple array-based page tables (usually called linear page tables) are too big, taking up far too much memory on typical systems. How can we make page tables smaller?
What are the key ideas?
What inefficiencies arise as a result of these new data structures?

20.1 Simple Solution: Bigger Pages

  • MULTIPLE PAGE SIZES

    • Usually, a small (4KB or 8KB) page size is used. However, if a “smart” application requests it, a single large page (e.g., of size 4MB) can be used for a specific portion of the address space.
    • The main reason for multiple page sizes is not to save page table space, however; it is to reduce pressure on the TLB, enabling a program to access more of its address space without suffering from too many TLB misses.
  • The major problem with this approach is that big pages lead to waste within each page, a problem known as internal fragmentation.

20.2 Hybrid Approach: Paging and Segments

思路:
one page table per logical segment.

we use the base not to point to the segment itself but rather to hold the physical address of the page table of that segment.

there are thus three base/bounds pairs.

On a context switch, these registers must be changed to reflect the location of the page tables of the newly-running process.

the hardware uses the segment bits (SN) to determine which base and bounds pair to use.

each bounds register holds the value of the maximum valid page in the segment.
For example, if the code segment is using its first three pages (0, 1, and 2), the code segment page table will only have three entries allocated to it and the bounds register will be set to 3; memory accesses beyond the end of the segment will generate an exception and likely lead to the termination of the process.

unallocated pages between the stack and the heap no longer take up space in a page table (just to mark them as not valid).

这种思路的问题:

  1. segmentation is not quite as flexible as we would like, as it assumes a certain usage pattern of the address space; if we have a large but sparsely-used heap, for example, we can still end up with a lot of page table waste.
  2. this hybrid causes external fragmentation to arise again.

20.3 Multi-level Page Tables

chop up the page table into page-sized units; then, if an entire page of page-table entries (PTEs) is invalid, don’t allocate that page of the page table at all.

To track whether a page of the page table is valid (and if valid, where it is in memory), use a new structure, called the page directory.
The page directory thus either can be used to tell you where a page of the page table is, or that the entire page of the page table contains no valid pages.

Linear (Left) And Multi-Level (Right) Page Tables

The page directory, in a simple two-level table, contains one entry per page of the page table. It consists of a number of **page directory entries (PDE)**.

A PDE (minimally) has a valid bit and a page frame number (PFN). if the PDE entry is valid, it means that at least one of the pages of the page table that the entry points to (via the PFN) is valid.

we add a level of indirection through use of the page directory.

这种思路的优点

  • it is generally compact and supports sparse address spaces.
  • if carefully constructed, each portion of the page table fits neatly within a page, making it easier to manage memory; the OS can simply grab the next free page when it needs to allocate or grow a page table.

这种思路的成本

  • on a TLB miss, two loads from memory will be required to get the right translation information from the page table (one for the page directory, and one for the PTE itself). Thus, the multi-level table is a small example of a time-space trade-off.
  • complexity

A Detailed Multi-Level Example

A Page Directory, And Pieces Of Page Table $$ \begin{array}{lc|rcc|ccc} & \text { Page Directory } & & {\text { Page of PT @PFN:100) }} & & & {\text { Page of PT (@PFN:101) }} \\ \text { PFN } & \text { valid? } & \text { PFN } & \text { valid } & \text { prot } & \text { PFN } & \text { valid } & \text { prot } \\ \hline 100 & 1 & 10 & 1 & \text{r}-\text{x} & {-}& 0 & {-}\\ {-}& 0 & 23 & 1 & \text{r}-\text{x} & {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & 80 & 1 & \text{rw}{-}& {-}& 0 & {-}\\ {-}& 0 & 59 & 1 & \text { rw} {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& {-}& 0 & {-}\\ {-}& 0 & {-}& 0 & {-}& 55 & 1 & \text { rw} -\\ 101 & 1 & {-}& 0 & {-}& 45 & 1 & \text { rw}- \end{array} $$

More Than Two Levels

what if the page directory gets too big?

#### The Translation Process: Remember the TLB Multi-level Page Table Control Flow:

20.4 Inverted Page Tables

instead of having many page tables (one per process of the system), we keep a single page table that has an entry for each physical page of the system.
Finding the correct entry is now a matter of searching through this data structure.

page tables are just data structures.

20.5 Swapping the Page Tables to Disk

some systems place such page tables in kernel virtual memory, thereby allowing the system to swap some of these page tables to disk when memory pressure gets a little tight.

20.6 Summary

  • The trade-offs such tables present are in time and space – the bigger the table, the faster a TLB miss can be serviced, as well as the converse.
  • In a memory-constrained system (like many older systems), small structures make sense.

Chapter 21 Beyond Physical Memory: Mechanisms

本章概要

  • 速度很慢的硬盘作为 swap space 被编入内存中, 通过 disk address.
  • page 不够, OS 的 page-fault handler 会使用硬盘的空间作为补充, 补充完成前 process 被阻塞. process 会执行的很慢.
  • 为了避免使用 swap space, OS 会使用 page replacement 寻找可释放的内存. 例如 high watermark (HW ) low watermark (LW ) 机制.
  • 实在不够只能使用 swap space 时, cluster 或者 group 成批地 swap.

    21.0 Intro

In our memory hierarchy, big and slow hard disk drives sit at the bottom.

THE CRUX: HOW TO GO BEYOND PHYSICAL MEMORY
How can the OS make use of a larger, slower device to transparently provide the illusion of a large virtual address space?

A contrast is found in older systems that used memory overlays, which required programmers to manually move pieces of code or data in and out of memory as they were needed.

The addition of swap space allows the OS to support the illusion of a large virtual memory for multiple concurrently-running processes.

21.1 Swap Space

We will simply assume that the OS can read from and write to the swap space, in page-sized units. To do so, the OS will need to remember the disk address of a given page.

21.2 The Present Bit

If the present bit is set to one, it means the page is present in physical memory and everything proceeds as above; if it is set to zero, the page is not in memory but rather on disk somewhere. The act of accessing a page that is not in physical memory is commonly referred to as a page fault.

21.3 The Page Fault

When a page fault, OS page-fault handler runs to determine what to do.

When the OS receives a page fault for a page, it looks in the PTE to find the address, and issues the request to disk to fetch the page into memory.

WHY HARDWARE DOESN’T HANDLE PAGE FAULTS

  • the disk operation itself is traditionally so slow that the extra overheads of running software are minimal.
  • to be able to handle a page fault, the hardware would have to understand swap space, how to issue I/Os to the disk, and a lot of other details which it currently doesn’t know much about.

具体过程
When the disk I/O completes, the OS will then update the page table to mark the page as present, update the PFN field of the page-table entry(PTE) to record the in-memory location of the newly-fetched page, and retry the instruction. This next attempt may generate a TLB miss, which would then be serviced and update the TLB with the translation (one could alternately update the TLB upon when servicing the page fault, to avoid this step). Finally, a last restart would find the translation in the TLB and thus proceed to fetch the desired data or instruction from memory at the translated physical address.

Note that while the I/O is in flight, the process will be in the blocked state.

21.4 What If Memory Is Full?

The process of picking a page to kick out, or replace is known as the page-replacement policy.

Making the wrong decision can cause a program to run at disk-like speeds instead of memory-like speeds.

21.5 Page Fault Control Flow

  • Page-Fault Control Flow Algorithm (Hardware)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
VPN = (VirtualAddress & VPN_MASK) >> SHIFT
(Success, TlbEntry) = TLB_Lookup(VPN)
if (Success == True) // TLB Hit
if (CanAccess(TlbEntry.ProtectBits) == True)
Offset = VirtualAddress & OFFSET_MASK
PhysAddr = (TlbEntry.PFN << SHIFT) | Offset
Register = AccessMemory(PhysAddr)
else
RaiseException(PROTECTION_FAULT)
else // TLB Miss
PTEAddr = PTBR + (VPN * sizeof(PTE))
PTE = AccessMemory(PTEAddr)
if (PTE.Valid == False)
RaiseException(SEGMENTATION_FAULT)
else
if (CanAccess(PTE.ProtectBits) == False)
RaiseException(PROTECTION_FAULT)
else if (PTE.Present == True)
// assuming hardware-managed TLB
TLB_Insert(VPN, PTE.PFN, PTE.ProtectBits)
RetryInstruction()
else if (PTE.Present == False)
RaiseException(PAGE_FAULT)
  • Page-Fault Control Flow Algorithm (Software)
1
2
3
4
5
6
7
PFN = FindFreePhysicalPage()
if (PFN == -1) // no free page found
PFN = EvictPage() // run replacement algorithm
DiskRead(PTE.DiskAddr, pfn) // sleep (waiting for I/O)
PTE.present = True // update page table with present
PTE.PFN = PFN // bit and translation (PFN)
RetryInstruction() // retry instruction

21.6 When Replacements Really Occur

To keep a small amount of memory free, most operating systems thus have some kind of high watermark (HW ) and low watermark (LW ) to help decide when to start evicting pages from memory.

when the OS notices that there are fewer than LW pages available, a background thread that is responsible for freeing memory runs. The thread evicts pages until there are HW pages available.

The background thread, sometimes called the swap daemon or page daemon.

many systems will cluster or group a number of pages and write them out at once to the swap partition, thus increasing the efficiency of the disk.

Chapter 22 Beyond Physical Memory: Policies

本章概要

  • 介绍 OS 分配内存的 policies. 基于原理是, 利用历史信息或者其他输入信息提升某些内存使用的优先级或者清理一些内存. 由于未来不可知, 因此不存在 optimal policy.
  • 管理内存可以看作一种 Cache Management, 并有评价指标: AMAT.
  • 介绍的 policies 有: FIFO, Random, LRU, Dirty Pages, prefetching, thrashing.

22.0 Intro

This memory pressure forces the OS to start paging out pages to make room for actively-used pages. Deciding which page (or pages) to evict is encapsulated within the replacement policy of the OS.

THE CRUX: HOW TO DECIDE WHICH PAGE TO EVICT
How can the OS decide which page (or pages) to evict from memory?
This decision is made by the replacement policy of the system, which usually follows some general principles (discussed below) but also includes certain tweaks to avoid corner-case behaviors.

22.1 Cache Management

Given that main memory holds some subset of all the pages in the system, it can rightly be viewed as a cache for virtual memory pages in the system.

  • our goal in picking a replacement
    policy for this cache is to minimize the number of cache misses; that is, to minimize the number of times that we have to go to disk to fetch the desired page.

  • average memory access time (AMAT) for a program (a metric computer architects compute for hardware caches:

    $$ \text{AMAT} = (\text{Hit}\% · T_M) + (\text{Miss}\% · T_D) $$
  • where $T_M$ represents the cost of accessing memory, and represents $T_D$ the cost of accessing disk.

  • the cost of disk access is so high in modern systems that even a tiny miss rate will quickly dominate the overall AMAT of running programs.

22.2 The Optimal Replacement Policy

replaces the page that will be accessed furthest in the future is the optimal policy. The reason this is true is simple: you will refer to the other pages before you refer to the one furthest out.

TYPES OF CACHE MISSES

  • Three C’s: compulsory, capacity, and conflict misses.
  • compulsory miss (or cold-start miss) occurs because the cache is empty to begin with and this is the first reference to the item.
  • a capacity miss occurs because the cache ran out of space and had to evict an item to bring a new item into the cache.
  • conflict miss arises in hardware because of limits on where an item can be placed in a hardware cache, due to something known as set-associativity; it does not arise in the OS page cache because such caches are always fully-associative, i.e., there are no restrictions on where in memory a page can be placed.

the future is not generally known; you can’t build the optimal policy for a general-purpose operating system. The optimal policy will thus serve only as a comparison point.

22.3 A Simple Policy: FIFO

  • BELADY’S ANOMALY

    • In general, you would expect the cache hit rate to increase (get better) when the cache gets larger. But in this case, with FIFO, it gets worse!
    • LRU has what is known as a stack property. For algorithms with this property, a cache of size $N + 1$ naturally includes the contents of a cache of size $N$. Thus, when increasing the cache size, hit rate will either stay the same or improve.
    • FIFO and Random (among others) clearly do not obey the stack property, and thus are susceptible to anomalous behavior.
  • FIFO simply can’t determine the importance of blocks: even though page 0 had been accessed a number of times, FIFO still kicks it out, simply because it was the first one brought into memory.

22.4 Another Simple Policy: Random

22.5 Using History: LRU

  • historical information

    • frequency
    • recency of access
  • This family of policies is based on what people refer to as the principle of locality.

  • The Least-Frequently-Used (LFU) policy replaces the least-frequently-used page.

  • the Least-Recently-Used (LRU) policy replaces the least-recently-used page.

  • the opposites of these algorithms exist: Most-Frequently-Used (MFU) and Most-Recently-Used (MRU).

  • the principle of locality

    • while locality is a good thing to keep in mind while designing caches of any kind (hardware or software), it does not guarantee success. Rather, it is a heuristic that often proves useful in the design of computer systems.

22.6 Workload Examples

  • The No-Locality Workload

  • no locality means that each reference is to a random page within the set of accessed pages.

  • “80-20” workload, which exhibits locality: 80% of the references are made to 20% of the pages (the “hot” pages); the remaining 20% of the references are made to the remaining 80% of the pages (the “cold” pages).

  • “looping sequential” workload

    • This workload, common in many applications (including important commercial applications such as databases [CD85]), represents a worst-case for both LRU and FIFO.

22.7 Implementing Historical Algorithms

  • One method that could help speed this up is to add a little bit of hardware support. For example, a machine could update, on each page access, a time field in memory (for example, this could be in the per-process page table, or just in some separate array in memory, with one entry per physical page of the system).

CRUX: HOW TO IMPLEMENT AN LRU REPLACEMENT POLICY
Given that it will be expensive to implement perfect LRU, can we approximate it in some way, and still obtain the desired behavior?

22.8 Approximating LRU

  • The idea requires some hardware support, in the form of a use bit (sometimes called the reference bit).

  • clock algorithm: Imagine all the pages of the system arranged in a circular list. A clock hand points to some particular page to begin with (it doesn’t really matter which). When a replacement must occur, the OS checks if the currently-pointed to page P has a use bit of 1 or 0. If 1, this implies that page $P$ was recently used and thus is not a good candidate for replacement. Thus, the clock hand is incremented to the next page $P + 1$, and the use bit for $P$ set to 0 (cleared). The algorithm continues until it finds a use bit that is set to 0, implying this page has not been recently used (or, in the worst case, that all pages have been and that we have now searched through the entire set of pages, clearing all the bits).

  • The 80-20 Workload With Clock

22.9 Considering Dirty Pages

  • some VM systems prefer to evict clean pages over dirty pages. if a page has been modified and is thus dirty, it must be written back to disk to evict it, which is expensive. If it has not been modified (and is thus clean), the eviction is free; the physical frame can simply be reused for other purposes without additional I/O.
  • the hardware should include a modified bit(a.k.a. dirty bit).

22.10 Other VM Policies

  • the OS also has to decide when to bring a page into memory. This policy, sometimes called the page selection policy.

    • For most pages, the OS simply uses demand paging, which means the OS brings the page into memory when it is accessed.
    • the OS could guess that a page is about to be used, and thus bring it in ahead of time; this behavior is known as prefetching.
  • Another policy determines how the OS writes pages out to disk.

    • clustering or simply grouping of writes.

22.11 Thrashing

  • what should the OS do when memory is simply oversubscribed, and the memory demands of the set of running processes simply exceeds the available physical memory?

  • the system will constantly be paging, a condition sometimes referred to as thrashing.

  • admission control

    • given a set of processes, a system could decide not to run a subset of processes, with the hope that the reduced set of processes working sets (the pages that they are using actively) fit in memory and thus can make progress.
  • out-of-memory killer when memory is oversubscribed; this daemon chooses a memory-intensive process and kills it.

Chapter 23 The VAX/VMS Virtual Memory System

本章概要

  • 介绍 VAX/VMS 的虚拟内存系统使用的一些技术.
  • 将 kernel virtual address space 指向到所有的 user address space, 在良好的 protection 下. 实现的效果是: kernel appears almost as a library to applications.
  • segmented FIFO replacement policy: 超过 resident set size 的进程如果是在 global clean-page free list 而不是 dirty-page list, 这个进程的内存会被替换掉.
  • 使用 protection bits 模拟 reference bits.
  • 应用 laziness 的设计思想: demand zeroing of pages.

23.1 Background

The VAX-11 minicomputer architecture was introduced in the late 1970’s by Digital Equipment Corporation (DEC).
the OS had to have mechanisms and policies that worked (and worked well) across this huge range of systems.

THE CRUX: HOW TO AVOID THE CURSE OF GENERALITY

23.2 Memory Management Hardware

  • 32-bit virtual address space per process, divided into 512-byte pages.

  • a virtual address consisted of a 23-bit VPN and a 9-bit offset. Further, the upper two bits of the VPN were used to differentiate which segment the page resided within.

  • hybrid of paging and segmentation

  • The VAX/VMS Address Space

  • One major concern of the VMS designers was the incredibly small size of pages in the VAX hardware.

    1. by segmenting the user address space into two, the VAX-11 provides a page table for each of these regions (P0 and P1) per process; thus, no page-table space is needed for the unused portion of the address space between the stack and the heap. The base and bounds registers are used; a base register holds the address of the page table for that segment, and the bounds holds its size (i.e., number of page-table entries).
    2. by placing user page tables (for P0 and P1, thus two per process) in kernel virtual memory. Putting page tables in kernel virtual memory means that address translation is even further complicated.

23.3 A Real Address Space

  • the kernel virtual address space (i.e., its data structures and code) is a part of each user address space.

    • On a context switch, the OS changes the P0 and P1 registers to point to the appropriate page tables of the soon-to-be-run process; however, it does not change the S base and bound registers, and as a result the “same” kernel structures are mapped into each user address space.

    • The kernel is mapped into each address space for a number of reasons.

      • This construction makes life easier for the kernel; when, for example, the OS is handed a pointer from a user program (e.g., on a write() system call), it is easy to copy data from that pointer to its own structures. The OS is naturally written and compiled, without worry of where the data it is accessing comes from. If in contrast the kernel were located entirely in physical memory, it would be quite hard to do things like swap pages of the page table to disk; if the kernel were given its own address space, moving data between user applications and the kernel would again be complicated and painful.
    • With this construction (now used widely), the kernel appears almost as a library to applications, albeit a protected one.

  • protection bits in the page table, what privilege level the CPU must be at in order to access a particular page.

23.4 Page Replacement

  • no reference bit! Thus, the VMS replacement algorithm must make do without hardware support

  • memory hogs, programs that use a lot of memory and make it hard for other programs to run.

  • Segmented FIFO

    • segmented FIFO replacement policy. The idea is simple: each process has a maximum number of pages it can keep in memory, known as its resident set size (RSS). Each of these pages is kept on a FIFO list; when a process exceeds its RSS, the “first-in” page is evicted.
    • To improve FIFO’s performance, VMS introduced two second-chance lists where pages are placed before getting evicted from memory, specifically a global clean-page free list and dirty-page list. When a process P exceeds its RSS, a page is removed from its per-process FIFO; if clean (not modified), it is placed on the end of the clean-page list; if dirty (modified), it is placed on the end of the dirty-page list.
  • Page Clustering

  • EMULATING REFERENCE BITS

    • you don’t need a hardware reference bit in order to get some notion of which pages are in use in a system. protection bits on the VAX can be used to emulate reference bits.
    • if you want to gain some understanding of which pages are actively being used in a system, mark all of the pages in the page table as inaccessible (but keep around the information as to which pages are really accessible by the process, perhaps in the “reserved OS field” portion of the page table entry). When a process accesses a page, it will generate a trap into the OS; the OS will then check if the page really should be accessible, and if so, revert the page to its normal protections (e.g., read-only, or read-write). At the time of a replacement, the OS can check which pages remain marked inaccessible, and thus get an idea of which pages have not been recently used.

23.5 Other Neat VM Tricks

  • One form of laziness in VMS (and most modern systems) is demand zeroing of pages.

    • In a naive implementation

      • the OS responds to a request to add a page to your heap by finding a page in physical memory, zeroing it (required for security; otherwise you’d be able to see what was on the page from when some other process used it!), and then mapping it into your address space (i.e., setting up the page table to refer to that physical page as desired).
      • But the naive implementation can be costly.
    • demand zeroing

      • it puts an entry in the page table that marks the page inaccessible. If the process then reads or writes the page, a trap into the OS takes place. When handling the trap, the OS notices (usually through some bits marked in the “reserved for OS” portion of the page table entry) that this is actually a demand-zero page; at this point, the OS then does the needed work of finding a physical page, zeroing it, and mapping it into the process’s address space.

TIP: BE LAZY

  • Laziness can put off work until later.
  • First, putting off work might reduce the latency of the current operation, thus improving responsiveness;
  • Second, and more importantly, laziness sometimes obviates the need to do the work at all;
  • copy-on-write (COW for short).

    • when the OS needs to copy a page from one address space to another, instead of copying it, it can map it into the target address space and mark it read-only in both address spaces.
    • If, however, one of the address spaces does indeed try to write to the page, it will trap into the OS. The OS will then notice that the page is a COW page, and thus (lazily) allocate a new page, fill it with the data, and map this new page into the address space of the faulting process.
    • In UNIX systems, COW is even more critical, due to the semantics of fork() and exec().

Note for 《Operating Systems Three Easy Pieces》 Part 1-2 Virtulization on Memory

https://www.chuxin911.com/Operating_Systems_Three_Easy_Pieces_part1-2_20220527/

作者

cx

发布于

2022-05-27

更新于

2022-07-16

许可协议