Category Archives: 6.824-2013

Hypervisor-based Fault-tolerance

Hypervisor-based Fault-tolerance



Protocols to implement a fault-tolerant computing system are described.
These protocols augment the hypervisor of a virtual machine manager and coordinate a primary virtual machine with its backup.
No modification to hardware, operating system, or application programs is required.

1. Introduction

One popular scheme for implementing fault tolerance involves replicating a computation on processors that fail independently.
Replicas are coordinated so that they execute the same sequence of instructions and produce the same results.
Use of a hypervisor to implement replica coordination is attractive.

This paper address these issuse by describing the protocols and performance of a prototype implementation of hypervisor-base fault-tolerance.

In $2, we describe the protocols. These protocols ensure that the sequence of instructions executed by two virtual machines running on diffenrent physical processors are identical.
The protocols also coordinate I/O issued by these virtual machines.

2. Replica-Coorination Protocols

In the primary/backup approach to fault-tolerance, n processors implement a system that can tolerate n-1 faults.
One processor is designated the primary and the others are designated backups. To obtain service, clients make requests of the primary.
The primary responds to each request and informs the backups of its actions so that a bakcup can take over if the primary fails.

Our implementation of fault-tolerant virtual machines uses the primary/backup approach in the hypervisor.

Our protocols use a single backup and implement a 1-fault-tolerant virtual machine. The protocols cause the backup virtual machine to execute exactly the same
sequence of instructions as the primary virtual machine, where each instruction executed by the backup has the same effect as when it executed by the primary.
The protocols also ensure that the environment does not see an anomalous sequence of I/O requests if the primary fails and the backup takes over while an I/O
operation is in progress.

One obvisou assumption, required so that the backup virtual machine can take over for the primary:

  • I/O Device Accessibility Assumption:I/O device accessible to the processor executing the primary virtual machine are also accessible to the processor executing the backup virtual machine.
  • Vitual Machine Assumption: System and user software executes correctly under the hypervisor.

2.1 Indentical Instruction Streams

In our scheme, a given instruction must have the same effect whether it is executed by the primary virtual machine or the backup.

Define the virtual machine state to include the memory and registers that change only with execution of instructions by that virtual machine.

We partition the instruction set into ordinary instructions, whose behavior is completely determined by virtual-machine state, and enviroment instructions, whose
behavior is not.Examples of ordinary instructions include those for arithmetic and data movement; examples of enviroment instructions include those for reading the
time-of-day clock, loding the interval timer, and for performing I/O.

  • ordinary instructions:只在系统内,只影响寄存器和内存.
  • enviroment instructions: 和外部关联,从外部读取数据,或者向外写出数据,不可控.

In order that the primary and backup virtual machines execute exactly the same sequence of instructions, both virtual machines are started in the same state.
By definition, the effects of executing ordinary instructions depend only on the virtual-machine state.

  • Ordinary Instruction Assumption: Executing the same ordinary instruction on two processors in the same virtual-machine state has exaclty the same effect.

Another assumption ensures that when executing an environment instruction. instruction, the hypervisor at the primary and backup virtual machines have an opportunity to
communicate. This allows both hypervisors to change the virtual-machine state in the same way.

  • Enviroment Instruction Assumption: Environment instructions are simulated by the hypervisor (and not executed directly by the hardware). The simulation
    ensures that a given environment instruction executed on two processors in the same virtual-machine state has exactly the same effect on the virtual-machine state.

To guarantee that the primary and backup virtual machines execute the same sequence of instructions, we must
ensure that identical interrupts are delivered to each and at the same points in their instruction streams. The presence of a
hypervisor helps here. The primary’s hypervisor can buffer and forward I/O interrupts it receives to the backup’s hypervisor.
And, the primary’s hypervisor can send to the backup’s hypervisor information about the value of the interval timer at the pro-
cessor executing the primary virtual machine
. Thus, by communicating with the primary’s hypervisor, the backup’s hypervisor
learns what interrupts it must deliver to the backup virtual machine,

This is because instruction-execution timing on most modem processors is unpredictable, Yet, interrupts must be
delivered at the same points in the primary and backup virtual machine instruction streams.
We must employ some other mech-
anism for transferring control to the hypervisor when a virtual machine reaches a specified point in its instruction stream.

The recovery register on HP’s PA-RISC processors is a register that is decremented each time an instruction completes;
an interrupt is caused when the recovery register becomes negative.

hypervisor that uses the recovery register can thus ensure that epochs at the primary and backup virtual machines
each begin and end at exactly the same point in the instruction stream.
Interrupts are delivered only on epoch boundaries.
A recovery register or some similar mechanism is, therefore, assumed.

  • Instruction-Stream Interrupt Assumption: A mechanism is available to invoke the hypervisor when a specified point in the instruction stream is reached.

By virtue of the Instruction-Stream Interrupt Assumption, execution of a virtual machine is partitioned into epochs, and
corresponding epochs at the primary and the backup virtual machines comprise the same sequences of instructions.

We have only to ensure that the same interrupts are delivered at the backup as at the primary
when each epoch ends.
The solution to this is for the primary and backup hypervisor to communicate, and at the end of an
epoch i to have the backup’s hypervisor deliver copies of the interrupts that primary’s hypervisor delivered at the end of its
epoch i.

We now summarize the protocol that ensures the primary and backup virtual machines each performs the same sequence
of instructions and receives the same interrupts.

The protocol is presented as a set of routines that are implemented in the hypervisor. These routines may be activated concurrently.

  • P0
  • P1
  • P2
  • P3
  • P4
  • P5
  • P6
  • P7

through P7 ensure that the backup virtual machine executes the same sequence of instructions (each having
the same effect) as the primary virtual machine. PO through P7also ensure that if the primary virtual machine fails, then instruc-
tions executed by the backup extend the sequence of instructions executed by the primary.

PO through P7 do not guarantee that interrupts from I/O devices are not lost.

2.1 Interaction with an Environment

The state of the environment is affected by executing I/O instructions. We must ensure that the sequence of I/O instruc-
tions seen by the environment is consistent with what could be observed were a single processor in use, even though our l-fault-
tolerant virtual machine is built using two processors. The problem is best divided into two cases.

Our solution is to exploit the reality that I/O devices are themselves subject to transient failures, and device drivers already cope with these faihrres.

  • P8

The effect of P8 is to cause certain I/O instructions to be repeated.

A Prototype System


Performance of the Prototype

对原型系统进行测试。对CPU-Intensive Workload和I/O Workloads进行了测试,以Epoch Length为维度。




  1. 用某种新方法解决某种问题
  2. 并设计模型
  3. 开发出原型系统
  4. 对原型系统进行测试,得到测试数据
  5. 根据测试数据进行总结


  1. 定义问题,因为真实的世界很复杂,你需要先对问题有清晰的定义。
  2. 定义你的模型,你的模型只能不断逼近真实世界,只能是真实世界的子集。
  3. 定义你模型的假设条件(公理系统???)
  4. 定义你的模型的状态集(state set)。
  5. 定义能够改变状态的操作和指令集。
  6. 定义你要的容错系统(Fault-tolerance)。
  7. 定义你要的一致性。(操作顺序,不同client发起不同的操作看到的视图)
  8. 实现你模型的假设条件,实现你的模型。
  9. 优化


  1. 找出问题(Fault-tolerant)
  2. 找出方法(Hypervisor replicas)
  3. 找出关键点(5个假设)
  4. 找出测试数据,找出影响测试的因素(Epoch Length)
  5. 找出方法中的问题
  6. 思考这种方法还可以用在哪些地方


  1. 研究的方法能用在分布式存储系统上。

Flat Datacenter Storage

Flat Datacenter Storage



Flat Datacenter Storage (FDS) is a high-performance, fault-tolerant, large-scale, locality-oblivious blob store.
Using a novel combination of full bisection bandwidth networks, data and metadata striping, and flow control,
FDS multiplexes an application’s large-scale I/O across the available throughput and latency budget of every disk
in a cluster. FDS therefore makes many optimizations around data locality unnecessary. Disks also commu-
nicate with each other at their full bandwidth
, making recovery from disk failures extremely fast.

  • disk之间的传输速度是全带宽的,这得益于使用一种新的网络。

1 Introduction

Counterintuitively, locality constraints can sometimes even hinder efficient resource utilization.

recently developed CLOS networks–large numbers of small commodity switches with
redundant interconnections—have made it economical to build non-oversubscribed full bisection bandwidth networks at the scale of a datacenter for the first time.

Flat Datacenter Storage (FDS) is a datacenter storage system designed from first principles under the formerly unrealistic assumption that datacenter bandwidth is abundant.

Unconventionally for a system of this scale, FDS returns to the flat storage model: all compute nodes can access
all storage with equal throughput. Even when computation is co-located with storage, all storage is treated
as remote; in FDS, there are no “local” disks.

  • 依赖新的网络结构,消除局部性,并提供一个简单和灵活的存储系统。所有存储都是remote的,没有local的disks。

2 Design Overview

FDS’ main goal is to expose all of a cluster’s disk bandwidth to applications. Blobs are divided into tracts
(§2.1), parallelizing both storage of data (§2.2) and handling of metadata (§2.3) across all disks.

FDS provides the best of both worlds: a scale-out system with aggregate
I/O throughput equal to systems that exploit local disks combined with the conceptual simplicity and flexibility of a logically centralized storage array.

  • FDS的主要目标是释放整个集群中所有磁盘的带宽。Blobs被分成tracts,data 和 metadata 被在所有disck中paralleizing/handling。

2.1 Blobs and Tracts

In FDS, data is logically stored in blobs. A blob is a byte sequence named with a 128-bit GUID.
Blobs can be any length up to the system’s storage capacity. Reads from and writes to a blob are done in units called tracts.
Tracts are sized such that random and sequential access achieves nearly the same throughput. In our cluster, tracts are 8MB.

  • 每个blob的有一个唯一标识 128-bit GUID. blob被分为定长的tracts.
  • Blobs可以是任意大小(不超过系统存储的容量),读写的单位都是tracts。
  • Tracts的定长的,它所设定的长度保证Blob的随机访问和顺序访问都保持同样的速度。

Every disk is managed by a process called a tractserver that services read and write requests that arrive
over the network from clients.
Tractservers do not use a file system. Instead, they lay out tracts directly to disk by
using the raw disk interface. Since there are only about 10^6 tracts per disk (for a 1TB disk), all tracts’ metadata
is cached in memory, eliminating many disk accesses.

  • 上面介绍每个磁盘都有一个叫做tractserver的进程进行管理。tracts保存在磁盘中.

In addition to the listed parameters, each function takes a callback function and an associated context pointer. All
calls in FDS are non-blocking; the library invokes the application’s callback when the operation completes.

The application’s callback functions must be reentrant; they are called from the library’s threadpool and
may overlap. Tract reads are not guaranteed to arrive in order of issue. Writes are not guaranteed to be commit-
ted in order of issue. Applications with ordering requirements are responsible for issuing operations after pre-
vious acknowledgments have been received, rather than concurrently. FDS guarantees atomicity: a write is either
committed or failed completely.

The non-blocking API helps applications achievegood performance. The FDS API GetSimultaneousLimit() tells the application how many reads and writes to issue concurrently.

  • 每个API都不是阻塞的。
  • 每个API都有callback函数。
  • 每个写操作都是原子性的。
  • 不保证读操作是有序的。
  • 不保证写操作是有序的。


  • Getting access to a blob

    • CreateBlob(UINT128 blobGuid)
    • OpenBlob(UNIT128 blobGuid)
    • CloseBlob(UNIT128 blobGuid)
    • DeleteBlob(UNIT128 blobGuid)
  • Interacting with a blob

    • GetBlobSize()
    • ExtendBlobSize(UNIT64 numberOfTracts)
    • WriteTract(UINT64 tractNumber, BYTE *buf)
    • ReadTract(UINT64 tractNumber, BYTE *buf)
    • GetSimultaneousLimit()

2.2 Deterministic data placement

A key issue in parallel storage systems is data placement and rendezvous, that is: how does a writer know
where to send data? How does a reader find data that has been previously written?

  • 在存储系统中,关键问题是数据是如何放置的.

Many systems solve this problem using a metadata server that stores the location of data blocks.

  • 许多系统使用元数据服务器去存放blocks的locations.

This method has the advantage of allowing maximum flexibility of data placement and visibility into the system’s state.
However, it has drawbacks: the metadata server is a central point of failure, usually implemented as a replicated
state machine, that is on the critical path for all reads and writes.

  • 优点是灵活的data placement和可见的system’state
  • 缺点是元数据服务器是所有读写操作的关键路径,容易成为瓶颈

In FDS, we took a different approach. FDS uses a metadata server, but its role during normal operations
is simple and limited: collect a list of the system’s active tractservers and distribute it to clients. We call this
list the tract locator table, or TLT. In a single-replicated system, each TLT entry contains the address of a sin-
gle tractserver. With k-way replication, each entry has k tractservers;

  • FDS也使用元数据服务器,但是它减少了元数据的总量。
  • client要找到某个block的location,它需要先算一步,再查表一步.

Tract Locator = (Hash(g)+i) mod TLT Length

Once clients find the proper tractserver address in the TLT, they send read and write requests containing the
blob GUID, tract number, and (in the case of writes) the data payload. Readers and writers rendezvous because
tractserver lookup is deterministic: as long as a reader has the same TLT the writer had when it wrote a tract, a
reader’s TLT lookup will point to the same tractserver.

In a single-replicated system, the TLT is constructed by concatenating m random permutations of the tract server list.

  • 防止clients读取blob时,它们访问Tract Locator的顺序是一致的.

In the case of non-uniform disk speeds, the TLT is weighted so that different tractservers appear in proportion to the measured speed of the disk

  • TLT还有一个功能是,它可以表示磁盘的权重,比如磁盘的速度等。

To be clear, the TLT does not contain complete information about the location of individual tracts in the sys-
tem. It is not modified by tract reads and writes. The only way to determine if a tract exists is to contact the tract-
server that would be responsible for the tract if it does exist. Since the TLT changes only in response to clus-
ter reconfiguration or failures it can be cached by clients for a long time. Its size in a single-replicated system is
proportional to the number of tractservers in the system (hundreds, or thousands), not the number of tracts stored
(millions or billions).

  • TLT只有当集群重新配置时或者failures时才会发生改变,所以它可以长时间cached在clients上。
  • TLT的大小和磁盘数量和副本数有关,和tracts的数量无光。

When the system is initialized, tractservers locally store their position in the TLT. This means the metadata
server does not need to store durable state, simplifying its implementation. In case of a metadata server failure, the
TLT is reconstructed by collecting the table assignments from each tractserver.

  • tractservers在本地也存了和它相关的部分TLT,这意味着元数据服务器不需要保证TLT的持久行,这简化了元数据服务器的实现。
  • 当元数据服务器failure时,TLT可以遍历所有tractserver进行重建.

To summarize, our metadata scheme has a number of nice properties:

  • The metadata server is in the critical path only when a client process starts.

  • The TLT can be cached long-term since it changes only on cluster configuration, not each read and
    write, eliminating all traffic to the metadata server in a running system under normal conditions

  • The metadata server stores metadata only about the hardware configuration, not about blobs. Since traf-
    fic to it is low, its implementation is simple and lightweight.

  • Since the TLT contains random permutations of the list of tractservers, sequential reads and writes by
    independent clients are highly likely to utilize all tractservers uniformly and are unlikely to organize
    into synchronized convoys.

  • 元数据服务器不会成为读写操作的关键路径
  • TLT可以长时间缓存在client上
  • 元数据服务器保存的元数据包括TLT和一些硬件配置,不包括blobs,所以它的traffic很低,这使得它的实现非常简单和轻型.
  • TLT上保存的tractservers列表要足够随机

2.3 Per-Blob Metadata

Each blob has metadata such as its length. FDS stores it in each blob’s special metadata tract (“tract −1”).
Clients find a blob’s metadata on a tractserver using the same TLT used to find regular data.

When a blob is created, the tractserver responsible for its metadata tract creates that tract on disk and initializes
the blob’s size to 0. When a blob is deleted, that tractserver deletes the metadata. A scrubber application scans
each tractserver for orphaned tracts with no associated metadata, making them eligible for garbage collection.

Newly created blobs have a length of 0 tracts. Applications must extend a blob before writing past the end
of it. The extend operation is atomic, is safe to execute concurrently with other clients, and returns the new size
of the blob as a result of the client’s call. A separate APItells the client the blob’s current size. Extend operations
for a blob are sent to the tractserver that owns that blob’s metadata tract.

  • blob的元数据也被当成tract保存在tactserver上,这使得处理blob的元数据和处理其他tract没啥区别。

2.4 Dynamic Work Allocation

3 Replication and Failure Recovery

In an n-disk cluster where one disk fails, roughly 1/n th of the replicated data will be found on
all n of the other disks. All remaining disks send the under-replicated data to each other in parallel, restoring
the cluster to full replication very quickly.

  • 优秀的data placement策略保证1/n的副本数据可以在所有n个disk中找到,所有的disk开始并行传输副本,Recovery会非常快.

Such fast failure recovery significantly improves durability because it reduces the win-
dow of vulnerability during which additional failures can cause unrecoverable data loss.

  • 快速的的failure recovery能够显著提供数据的持久性,因为它减少了vulnerability的窗口时间.

3.1 Replication

When an application writes a tract, the client library finds the appropriate row of the TLT and sends the write to every tract-
server it contains. Reads select a single tractserver at random. Applications are notified that their writes have
completed only after the client library receives write acknowledgments from all replicas.

  • 副本策略。同时向多个副本发起写操作。只想一个副本发起读操作。
  • 可以根据一致性的要求,改变副本策略。

Replication also requires changes to CreateBlob, ExtendBlobSize, and DeleteBlob. Each mutates the
state of the metadata tract and must guarantee that updates are serialized. Clients send these operations only
to the tractserver acting as the primary replica, marked as such in the TLT. When a tractserver receives one of
these operations, it executes a two-phase commit with the other replicas. The primary replica does not commit the
change until all other replicas have completed successfully. Should the prepare fail, the operation is aborted.

  • 在 CreateBlob, ExtendBlobSize,DeleteBlob操作中,primary replica使用two-phase commit保证一致性.

FDS also supports per-blob variable replication, for example, to single-replicate intermediate computations
for write performance, triple-replicate archival data for durability, and over-replicate popular blobs to increase
read bandwidth. The maximum possible replication level is determined when the cluster is created and drives the
number of tractservers listed in each TLT entry. Each blob’s actual replication level is stored in the blob’smeta-
data tract and retrieved when a blob is opened. For an n-way replicated blob, the client uses only the first n tract-
servers in each TLT entry

  • FDS支持blob可变的副本数,每个blob的实际副本等级是保存在blob的元数据tract中,对于n副本的系统,client只使用TLT entry中地一个tractserver。

3.2 Failure recovery

We begin with the simplest failure recovery case: the failure of a single tractserver.

  • 简单的故障恢复:只有一个tractserver发生故障了。

Tractservers send heartbeat messages to the metadata server. When the metadata server detects a tractserver
timeout, it declares the tractserver dead. Then, it:

  • invalidates the current TLT by incrementing the version number of each row in which the failed tractserver appears;

  • picks random tractservers to fill in the empty spaces in the TLT where the dead tractserver appeared;

  • sends updated TLT assignments to every server affected by the changes; and

  • waits for each tractserver to ack the new TLT assignments, and then begins to give out the new TLT to clients when queried for it.

  • 重新生成一个新的TLT

When a tractserver receives an assignment of a new entry in the TLT, it contacts the other replicas and begins
copying previously written tracts. When a failure occurs, clients must wait only for the TLT to be updated; operations can continue while re-replication is still in progress.

  • 开始复制tracts

All operations are tagged with the client’s TLT version. If a client attempts an operation using a stale TLT
entry, the tractserver detects the inconsistency and rejects the operation. This prompts the client to retrieve an updated TLT from the metadata server.

  • 每个TLT entry都有一个version号,每个operation也有一个version号,当tractserver检查到version不一致时,它会拒绝执行operation。

Table versioning prevents a tractserver that failed and then returned, e.g., due to a transient network outage,
from continuing to interact with applications as soon as the client receives a new TLT. Further, any attempt by a
failed tractserver to contact the metadata server will result in the metadata server ordering its termination.

After a tractserver failure, the TLT immediately converges to provide applications the current location to read or write data.

  • 当一个tractserver发生故障时, TLT可以立即收敛。

3.2.1 Additional failure scenarios

We now extend our description to concurrent and cascading tractserver failures as well as metadata server failures.

  • 现在扩大故障范围

When multiple tractservers fail, the metadata server’s only new task is to fill more slots in the TLT. Similarly, if
failure recovery is underway and additional tractservers fail, the metadata server executes another round of the
protocol by filling in empty table entries and incrementing their version. Data loss occurs when all the tractservers within a row fail within the recovery window.

  • 当多个tarctservers发生故障时。
  • 当failure recovry进行时,又有tractserver发生故障。

A simple TLT might have n rows with each row listing disks i and i+1. While
data will be double-replicated, the cluster will not meet our goal of fast failure recovery: when a disk fails, its
backup data is stored on only two other disks (i+1 and i−1). Recovery time will be limited by the bandwidth of
just two disks. A cluster with 1TB disks with 100MB/s read performance would require 90 minutes for recovery.
A second failure within that time would have roughly a 2/n chance of losing data permanently.

A better TLT has O(n2) entries. Each possible pair of disks (ignoring failure domains; §3.3.1) appears in an
entry of the TLT. Since the generation of tract locators is pseudo-random (§2.2), any data written to a disk will
have high diversity in the location of its replica. When a disk fails, replicas of 1/nth of its data resides on the
other n disks in the cluster. When a disk fails, all n disks can exchange data in parallel over FDS’ full bisection
bandwidth network. Since all disks recover in parallel, larger clusters recover from disk loss more quickly.

While such a TLT recovers from single-disk failure quickly, a second failure while recovery is in progress is
guaranteed to lose data. Since all pairs of disks appear as TLT entries, any pair of failures will lose the tracts whose
TLT entry contained the pair of failed disks. Replicated FDS clusters therefore have a minimum replication level
of 3. Perhaps counterintuitively, no level of replication ever needs a TLT larger than O(n2). For any replica-
tion level k > 2, FDS starts with the “all-pairs” TLT, then expands each entry with k−2 additional random disks
(subject to failure domain constraints).

  • 最好的TLT构成方式,当一个磁盘故障时,所有的磁盘并行recover。

Constructing the replicated TLT this way has several important properties. First, performance during recovery
still involves every disk in the cluster since every pair of disks is still represented in the TLT.
Second, a triple disk failure within the recovery window now has only about a 2/n chance1 of causing permanent data loss.

To understand why, imagine two disks fail. Find the entries in the TLT that contain those two
disks. We expect to find 2 such entries. There is a 1/n chance that a third disk failure will match the random
third disk in that TLT entry. Finally, adding more replicas decreases the probability of data loss. Consider now a 4-way replicated clus-
ter. Each entry in the O(n2)-length TLT has two random disks added instead of one. 3 or fewer simultaneous fail-
ures are safe; 4 simultaneous failures have a 1/n^2 chance of losing data. Similarly, 5-way replication means that
4 or fewer failures are safe and 5 simultaneous failures havea1/n^3 chance of loss.

  • 好的TLT还能提高数据持久性

One possible disadvantage to a TLT with O(n2) entries is its size. In our 1,000-disk cluster, the in-memory TLT
is about 13MB. However, on larger clusters, quadratic growth is cumbersome: 10,000 disks would require a
600MB TLT. We have two (unimplemented) strategies to mitigate TLT size. First, a tractserver can manage multiple disks;
this reduces n by a factor of 5–10. Second, we can limit the number of disks that participate in failure re-
covery. An O(n2) TLT uses every disk for recovery, but 3,000 disks are expected to recover 1TB in less than
20s (§5.3). The marginal utility of involving more disks may be small. To build an n-disk cluster where m disks
are involved in recovery, the TLT only needs O(XXX) entries. For 10,000 to 100,000 disks, this also reduces
table size by a factor of 5–10. Using both optimizations,a 100,000 disk cluster’s TLT would be a few dozen MB.

  • TLT的大小,TLT可以缓存在内存中,TLT的大小和磁盘的数量有关,但是可以进行优化.

3.3.1 Failure domains

A failure domain is a set of machines that have a high probability of experiencing a correlated failure. Com-
mon failure domains include machines within a rack,since they will often share a single power source, or ma-
chines within a container, as they may share common cooling or power infrastructure.
FDS leaves it up to the administrator to define a failure domain policy for a cluster. Once that policy is de-
fined, FDS adheres to that policy when constructing the tract locator table. FDS guarantees that none of the disks
in a single row of the TLT share the same failure domain. This policy is also followed during failure recov-
ery: when a disk is replaced, the new disk must be in a different failure domain than the other tractservers in that
particular row.

  • 故障域,同一个机箱,同一个机架上的磁盘都是在同一个故障域中,每个TLT的entry中不能包含在同一个故障域的磁盘.

3.4 Gluster growth

FDS supports the dynamic growth of a cluster through the addition of new disks and machines. For simplicity,
we first consider cluster growth in the absence of failures. Cluster growth adds both storage capacity and
throughput to the system. The FDS metadata server re-balances the assignment of table entries so that both ex-
isting data and new workloads are uniformly distributed. When a tractserver is added to the cluster, TLT entries are
taken away from existing tractservers and given to the new server. These assignments happen in two phases.
First, the new tractserver is given the assignments but they are marked as “pending” and the TLT version for
each entry is incremented. The new tractserver then be-gins copying data from other replicas. During this phase,
clients write data to both the existing servers and the new server so that the new tractserver is kept up-to-date. Once
the tractserver has completed copying old data, the meta-data server ‘commits’ the TLT entry by incrementing its
version and changing its assignment to the new tract-server.

It also notifies the now replaced tractserver that it can safely garbage collect all tracts associated with that TLT entry. If a new tractserver fails while its TLT entries are
pending, the metadata server increments the TLT entry version and expunges it from the list of new tractservers.

If an existing server fails, the failure recovery protocol executes. However, tractservers with pending TLT en-
tries are not eligible to take over for failed servers as they are already busy copying data.

  • 如何处理集群扩展

3.5 Consistency guarantees

The current protocol for replication depends upon the client to issue all writes to all replicas. This decision
means that FDS provides weak consistency guarantees to clients. For example, if a client writes a tract to 1
of 3 replicas and then crashes, other clients reading dif- The current protocol for replication depends upon the
client to issue all writes to all replicas. This decision means that FDS provides weak consistency guarantees
to clients. For example, if a client writes a tract to 1 of 3 replicas and then crashes, other clients reading dif-
cas of that tract will observe differing state. Weak consistency guarantees are not uncommon; for ex-
ample, clients of the Google File System must handle garbage entries in files. However, if strong consis-
tency guarantees are desired, FDS could be modified to use chain replication to provide strong consistency
guarantees for all updates to individual tracts. Tractservers may also be inconsistent during failure
recovery. A tractserver recently assigned to a TLT entry will not have the same state as the entry’s other repli-
cas until data copying is complete. While in this state, tractservers reject read requests; clients use other repli-
cas instead.

  • 目前client发起的写操作会向所有的副本写数据,这说明FDS目前提供的是弱一致性。
  • 可以使用chain replication提供强一致性.

4 Networking

5 Microbenchmarks

Disk count 100 1,000
Disks failed 1 1 1 1 7
Total(TB) 4.7 9.2 47 92 92
GB/disk 47 92 47 92 92
GB recov. 47 92 47 92 655
Recovery time(s) 19+-0.7 50.5+-16.0 3.3+-0.6 6.2+-0.4 33.7+-1.5



  1. 分析目前数据中心网络的问题,没有一个Flat的网络,各个节点间的带宽/速度不一致。
  2. 从利用局部性到抛弃局部性,没有local,只有remote。
  3. 根据这种新的模型,设计新的存储系统



data view
data operator
data flow
data store
data unit
  1. 根据网络模型设计存储系统
  2. 设计概述
    1. Blobs and tracts: data view, data operator
    2. Deterministic data placement: data flow
      • 从data view到data store需要映射
      • 有两种方法得到这个映射:算法,查表
      • 算法有一致性哈系算法等,操作路径不会经过元数据服务器,纯使用算法不够灵活
      • 查表更灵活,但是元数据服务器会成为瓶颈,破解办法在于减少元数据的数量,而且元数据不经常变更,这使得元数据可以缓存在client上
      • 最好的映射办法是:算法+查表
      • FDS使用TLT的办法,TLT的构造很关键,TLT会影响data flow,强大的data flow可以让所有disk并行工作(提高性能),让数据分布在所有disk上(快速恢复)
    3. Per-blob metadata: 如何处理data view的元数据
  3. 副本与故障恢复
    1. Replication
      • 基本的副本策略
      • 副本操作
      • 针对每个Blog可变的副本级别(存在blob的metadata上)
    2. Failure Recovery
      • 先定义简单的Failure 场景和Recovery策略
      • 再定义复杂的Failure 场景
      • TLT的构造对Recovery的速度有影响,Recovery的速度对数据持久行有影响
    3. Failure domains
      • 防止同一个机箱、机架上的磁盘在同一个TLT entry上。
    4. Gluster growth
      • 扩容的方法
    5. Consistency guarentees
      • 一致性的策略应该是可以调整的,以便满足用户的需求


  1. TLT的设计与构造值得借鉴
  2. 任何功能的策略应该是可以替换的

Lab1:Lock Server

6.824 Lab1:Lock Server


你的任务是创建一个lock service,当一台server发生故障时,它还能正常工作。 lock service中维持了一组locks的状态。

你的lock service 应该运行(replicated)在两台servers上,一台作为primary,另外一台作为backup。


这个系统唯一需要tolerate的failure是”a single fail-stop server failure”。



  1. primary和backup正常运行。client把request发给primary, primary把request转发给backup。
  2. primary当机,backup正常运行。client把request发给primary报错,client再把request发给backup。
  3. primary把request转发给backup之后,primary当机。client把request重发给backup。


  1. 因为backup需要和primary保持同样的状态,所以需要一个同步机制(replicated)。
  2. 如何处理client重新发送的相同request?
  3. locks的状态是由什么确定的?是多个client发起的request的次序决定的?


  1. 需要一个容器,维护locks的状态。
  2. 当primary接收到一个request时,它会转发给backup。
  3. 需要一个容器,里面有各个client发送的最后一个request和reply。当primary和backup收到相同的request时,它会回复以前的reply。
  4. locks的状态是由收到request的次序决定的,是由外面看到的request决定的,需要把lock service看成一个黑箱子。



  1. 定义功能,设计架构,定义状态(状态包括哪些key和value),定义能改变状态的动作。
  2. 定义能够容忍的failure
  3. 列出各种failure的情况
  4. 找出各种failure情况的解决办法

Lecture 4: FDS Case Study

Lecture 4: FDS Case Study

Flat Datacenter Storage
Nightingale, Elson, Fan, Hofmann, Howell, Suzue
OSDI 2012

what is FSD?

a cluster storage system
stores giant blobs -- 128-bit ID, multi-megabyte content
for big-data processing (like MapReduce)
cluster of 1000s of computers processing data in parallel

Hign-level design — a common pattern

lots of clients
lots of storage servers ("tractservers")
partition the data
master ("metadata server") controls partitioning
replica groups for reliability

why is this high-level design useful?

1000s of disks of space
store giant blobs, or many big blobs
1000s of servers/disks/arms of parallel throughput
can expand over time -- reconfiguration
large pool of storage servers for instance replacement after failure

what should we want to know from the paper?

  • API?
  • layout?
  • finding data?
  • add a server?
  • replication?
  • failure handling?
  • failure model?
  • consistent reads/writes? (i.e. does a read see latest write?)
  • config mgr failure handling?
  • good performance?
  • useful for apps?



  • does the abstract’s 2 GByte/sec per client seem impressive?

    how fast can you read a file from Athena AFS? (abt 10 MB/sec)
    how fast can you read a typical hard drive?
    how fast can typical networks move data?

  • 每个磁盘的速度大概是100MB/s,千兆网卡的速度可以达到100MB/s, 万兆网卡的速度可以达到1000MB/s。
  • 这是因为FDS最大并行化磁盘。
  • abstract claims recover from lost disk (92 GB) in 6.2 seconds

    that’s 15 GByte / sec
    how is that even possible? that’s 30x the speed of a disk!
    who might care about this metric?

  • lost dick上每个数据的副本都保存在不同的磁盘上,Recovery在多个磁盘上同时进行。150个磁盘读,150个磁盘写,可以达到15GB/s.
  • 当recover的速度越快,数据的持久性就越高.
  • why are 128-bit blob IDs a nice interface? why not file names?
  • 使用128-bit最为blob ID,做HASH之后离散性更好,接口比较统一,可以被其他系统对接。
  • why do 8 MB tracts make sense?
  • 在文中的系统中以8M为tracts的大小,则对blob的随机操作和顺序操作的的响应速度几乎没有差别。
  • what kinds of client applications is the API aimed at? and not aimed at?
  • 分布式应用,处理大数据的应用,而不是使用POSIX接口的应用


  • why have tracts at all? why not store each blob on just one server? what kinds of apps will benefit from striping? what kinds of apps won’t?
  • 条带化
  • 把每个blob只放在一个server会造成性能瓶颈,不能利用并行化
  • 条带化利于需要读取大量数据的应用
  • how fast will a client be able to read a single tract?
  • 因为每个tract保存在磁盘中,所以读取一个tract的速度跟磁盘的速度有关,最快是100MB/S
  • where does the abstract’s single-client 2 GB number come from?
  • 读取多个tracts,多个磁盘并行操作。
  • why not the UNIX i-node approach?

    store an array per blob, indexed by tract #, yielding tractserver
    so you could make per-tract placement decisions
    e.g. write new tract to most lightly loaded server

  • 导致元数据太多
  • why not hash(b + t)?
  • This effectively selected a TLT entry independently at random for each tract, producing a binomial rather than a uniform distribution of tracts on tractservers.
  • how many TLT entries should there be?
    how about n = number of tractservers?
    why do they claim this works badly? Section 2.2
  • Section 3.3中
  • 不仅要考虑tract是否均匀分布在所有的TLT entries上,也要考虑tractserver是否均匀分布在所有的TLT entries上。
  • 最好的TLT应该有O(n^2)个entries,每一对disks都会出现在TLT的entry上。
  • 要保证在每个tractserver上在更多的TLT entry上,这样才能使得recovery速度更快。
  • why is the paper’s n^2 scheme better?

    TLT with n^2 entries, with every server pair occuring once
    How long will repair take?
    What are the risks if two servers fail?

  • 有n个disk
  • 在n^2 shceme中,在TLT中有n^2个TLT entries。每个disk在2n个TLT entries中。
  • 设所有tracts的总容量是V,则每个disk上有 2V/n 的数据。
  • 当一个disk故障时,需要系统需要复制 2V/n 的数据,才能收敛到原来的状态。
  • 在其他的每块disk上,都含有n分之一这2V/n 的数据。
  • 重新生成2n个TLT entries,则剩下的disk就会在 2n+2 个TLT entries中,每个disk需要传输 4 个TLT entries的数据(写入2个,写出2个),
    大概需要是 4V/n^2 的数据量,大概需要 4V/(m*n^2) 的时间。
  • why do they actually use a minimum replication level of 3?

    same n^2 table as before, third server is randomly chosen
    What effect on repair time?
    What effect on two servers failing?
    What if three disks fail?

  • 假设TLT的使用n^2 scheme,且副本数为2,则任意坏两块磁盘,则必定有两个TLT entry上的数据丢失(思考一下为什么)。
  • 当副本数为3时,不影响recovery时间。
  • 当两个disk故障时,不会有TLT entry上的数据丢失。
  • 当两个disk故障时,在recovery window坏第三块disk,会有2/n的概率导致数据丢失。(思考为什么是2/n)

Adding a tractserver

To increase the amount of disk space / parallel throughput
Metadata server picks some random TLT entries
Substitutes new server for an existing server in those TLT entries
  • how long will adding a tractserver take?
  • 假设TLT使用的是n^2 scheme,则每个disk需要有2V/n的数据,则需要花费 2V/(n*m) 的时间。(所有tracts的总容量是V, n是磁盘个数,m是磁盘的传输速率)
  • what about client writes while tracts are being transferred?

    receiving tractserver may have copies from client(s) and from old srvr
    how does it know which is newest?

  • client的数据是最新的
  • what if a client reads/writes but has an old tract table?
  • TLT entry有version号,假如tractserver发现version不对,会拒绝请求。


A writing client sends a copy to each tractserver in the TLT.
A reading client asks one tractserver.
  • why don’t they send writes through a primary?
  • 牺牲一致性,提高性能.
  • what problems are they likely to have because of lack of primary?
    why weren’t these problems show-stoppers?
  • blob的元数据操作需要primary。因为这些操作需要强一致性。
  • why not just pick one replacement server?
  • 假如选择一个替换的disk, 数据都会写到这个disk中,disk的传输速度就是瓶颈。
  • 最完美的recovery是需要所有disk并行传输数据。
  • how long will it take to copy all the tracts?
  • 需要 4V/(m*n^2) 的时间,详细解释看上文。
  • if a tractserver’s net breaks and is then repaired, might srvr serve old data?
  • 看不懂问题….
  • if a server crashes and reboots with disk intact, can contents be used?

    e.g. if it only missed a few writes?
    3.2.1’s “partial failure recovery”
    but won’t it have already been replaced?
    how to know what writes it missed?

  • 在内存中记录tract的version号
  • when is it better to use 3.2.1’s partial failure recovery?
  • tansient failures, the tractserver later returns to service.
  • What happens when the metadata server crashes?
  • 等待metadata server重启,并且重构TLT
  • while metadata server is down, can the system proceed?
  • 可以, client上有TLT的缓存.
  • is there a backup metadata server?
  • Network partitions complicate recovery from metadata server failures. A simple primary/backup scheme is not safe because two active metadata servers
    will cause cluster corruption. Our current solution is simple: only one metadata server is allowed to execute at a time. If it fails, an operator must
    ensure the old one is decommissioned and start a new one. We are experimenting with using Paxos leader election to safely make this process
    automatic and reduce the impact to availability. Once a new metadata server starts, tractservers contact it with
    their TLT assignments. After a timeout period, the new metadata server reconstructs the old TLT.
  • how does rebooted metadata server get a copy of the TLT?
  • 遍历所有的tractservers,得到TLT entry的内容
  • does their scheme seem correct?

Random issues

  • is the metadata server likely to be a bottleneck?
  • 不是operation的关键路径,不是瓶颈.
  • why do they need the scrubber application mentioned in 2.3? why don’t they delete the tracts when the blob is deleted?can a blob be written after it is deleted?
  • 因为元数据记录blob中每个tract是否存在. 当删除blob时,需要查询blob所有的tract是否存在,这会消耗很多时间。


  • how do we know we’re seeing “good” performance? what’s the best you can expect?

  • limiting resource for 2 GB / second single-client?

  • Each computer in our cluster has a dual-port 10Gbps NIC.
  • Figure 4a: why starts low? why goes up? why levels off? why does it level off at that particular performance?
  • 总的吞吐量和磁盘的数量有关.
  • Figure 4b shows random r/w as fast as sequential (Figure 4a). is this what you’d expect?
  • 因为每个tract的大小是8MB,所以随机操作和顺序操作已经没有差别了(单副本)。
  • why are writes slower than reads with replication in Figure 4c?
  • 因为是三副本,每次写操作需要写三份数据.
  • where does the 92 GB in 6.2 seconds come from?

    Table 1, 4th column
    that’s 15 GB / second, both read and written
    1000 disks, triple replicated, 128 servers?
    what’s the limiting resource? disk? cpu? net?

  • 理论上的时间是 4V/(mn^2)= 292TB/(100MB*1000^2) = 3.64 second
  • 每个磁盘输出输入的 364MB, 假设每个server上有10个disk,则每个server输出输入的总流量是 3640MB
  • 瓶颈在于网络

Lecture 3: Primary/Backup Replication

Lecture 3: Primary/Backup Replication

Today’s goals

  • avaliablity
  • correctness
  • handle network failures and repair

Tool: replication of service’s state

How to ensure two replicas(server) remain ientical?

  1. copy the whole state, and/or
  2. apply same operations in same order to both replicas

what wider classes of failure would we like to handle?

  • temporary or permanent loss of connectivity
  • network partitions
  • can’t know if a server is crashed or just not reachable

Lab2 goals

  • tolerate network problems, including partition
    • either keep going, correctly
    • or suspend operations until network is repaired
  • replacement of failed servers

Lab2 overview

  • agreement

    • “view server” decides who p and b are
    • clients and servers ask view server
    • they don’t make independent decisions
    • only one vs, avoids multiple machines independently deciding who is p
  • repair

    • view server can co-opt “idle” server as b after old b becomes p
    • primary initializes new backup’s state
  • the tricky part

    • only one primary!
    • primary must have state!

View server

  • maintains a sequence of “views”
  • monitors server liveness

how to ensure new primary has up-to-date replica of state?

  • only promote previous backup

how to avoid promoting a state-less backup?

lab 2 answer:
* primary in each view must acknowledge that view to viewserver
* viewserver must stay with current view until acknowledged
* even if the primary seems to have failed
* no point in proceeding since not acked == backup may not be initialized

how to ensure only one server acts as primary?

the rules:
1. non-backup must reject forwarded requests
2. primary in view i must have been primary or backup in view i-1
3. non-primary must reject direct client requests
4. primary must wait for backup to accept each request

how can new backup get state?

if S2 is backup in view i, but was not in view i-1, S2 should ask primary to transfer the complete state

rule for state transfer:

every operation (Lock,Unlock) must be either before or after state xfer
if before, xferred state must reflect operation
if after, primary must forward operation after xfer finishes

Lecture 2: Infrastructure: RPC and threads

Lecture 2: Infrastructure: RPC and threads

Remote Procedure Call (RPC)

  • a key piece of distrib sys machinery; you’ll see in lab 1
  • goal: easy-to-program network communication
    • hides most details of client/server communication
    • client call is much like ordinary procedure call
    • server handlers are much like ordinary procedures

RPC is widely used!
RPC ideally makes net communication look just like fn call
RPC aims for this level of transparency

RPC message diagram:

    Client  Server

Software structure

    client app         handlers
    stubs           dispatcher
    RPC lib           RPC lib
    net  ------------ net

RPC problem: what to do about failures?

  • e.g. lost packets, broken network, crashed servers

Simplest scheme: “at least once” behavior

Better RPC behavior: “at most once”

What about “exactly once”?

  • at-most-once plus unbounded retries

Go RPC is “at-most-once”

Lecture 1: Introduction and lab overview

Lecture 1: Introduction and lab overview

What is a distributed system?

multiple networked cooperating computers

Why distribute?

  • to connect physically separate entities
  • to achieve security via physical isolation
  • to tolerate faults via replication at separate sites
  • to increase performance via parallel CPUs/mem/disk/net


  • complex, hard to debug
  • advice: don’t distribute if a central system will work


Topic: architecture

  • Choice of interfaces 接口
  • Single machine room or unified wide area system? 范围
  • Client/server or peer-to-peer? 结构
  • Interact w/ performance, security, fault behavior. 其他

Topic: implementation

  • Most systems organize distribution with some structuring framework(s)
  • RPC, RMI, DSM, MapReduce, &c

Topic: performance

  • Distribution can hurt: network b/w and latency bottlenecks.
    Lots of tricks, e.g. caching, threaded servers
  • Distribution can help: parallelism, pick server near client
  • Need a way to divide the load by N

Topic: fault tolerance

Topic: consistency


focus: fault tolerance and consistency — central to distrib sys

what fault-tolerance properties might we want?

  • available
  • durable
  • consistent

what kinds of faults might we want to tolerate?

  • network:

    • lost packets
    • duplicated packets
    • temporary network failure
      • server disconnected
      • network partitioned
  • server

    • server crash+restart
    • server fails permanently
    • all servers fail simultaneously — power/earthquake
    • bad case: crash mid-way through complex operation
    • bugs — but not in this course
    • malice — but not in this course
  • client fails

tools for dealing with faults?

  • retry — e.g. if pkt is lost, or server crash+restart
  • replicate — e.g. if one server or part of net has failed
  • replace — for long-term health


“failure model”: single fail-stop failure

  • ONLY failure: one server halts
  • NO network failures
  • NO re-start of servers
  • thus: no response means that the server has halted
  • thus: at least one of primary/backup guaranteed to be alive
  • fail-stop reasonable for tightly coupled primary+backup
  • fail-stop not a good model for biggish internet-based systems
    • due to network failures

fault-tolerance scheme: replication via “primary/backup”

  • replicate the service state
    • for each lock, locked or not
  • one copy on primary server, one on backup server
  • clients send operations to primary
  • primary forwards to backup so backup can update its state
  • primary replies to client after backup replies