昨天在EMC WORLD 2013大会上，EMC推出了软件定义存储平台ViPR，我来八一八。
昨天在EMC WORLD 2013大会上，EMC推出了软件定义存储平台ViPR，我来八一八。
下面的内容来自于 The MIT Challenge 。
Ceph是统一分布式存储系统，具有优异的性能、可靠性、可扩展性。Ceph的底层是RADOS(可靠、自动、分布式对象存储)，可以通过LIBRADOS直接访问到RADOS的对象存储系统。RBD(块设备接口)、RADOS Gateway(对象存储接口)、Ceph File System(POSIX接口)都是基于RADOS的。
有data plane和 control plane。
Protocols to implement a fault-tolerant computing system are described.
These protocols augment the hypervisor of a virtual machine manager and coordinate a primary virtual machine with its backup.
No modification to hardware, operating system, or application programs is required.
One popular scheme for implementing fault tolerance involves replicating a computation on processors that fail independently.
Replicas are coordinated so that they execute the same sequence of instructions and produce the same results.
Use of a hypervisor to implement replica coordination is attractive.
This paper address these issuse by describing the protocols and performance of a prototype implementation of hypervisor-base fault-tolerance.
In $2, we describe the protocols. These protocols ensure that the sequence of instructions executed by two virtual machines running on diffenrent physical processors are identical.
The protocols also coordinate I/O issued by these virtual machines.
In the primary/backup approach to fault-tolerance, n processors implement a system that can tolerate n-1 faults.
One processor is designated the primary and the others are designated backups. To obtain service, clients make requests of the primary.
The primary responds to each request and informs the backups of its actions so that a bakcup can take over if the primary fails.
Our implementation of fault-tolerant virtual machines uses the primary/backup approach in the hypervisor.
Our protocols use a single backup and implement a 1-fault-tolerant virtual machine. The protocols cause the backup virtual machine to execute exactly the same
sequence of instructions as the primary virtual machine, where each instruction executed by the backup has the same effect as when it executed by the primary.
The protocols also ensure that the environment does not see an anomalous sequence of I/O requests if the primary fails and the backup takes over while an I/O
operation is in progress.
One obvisou assumption, required so that the backup virtual machine can take over for the primary:
In our scheme, a given instruction must have the same effect whether it is executed by the primary virtual machine or the backup.
Define the virtual machine state to include the memory and registers that change only with execution of instructions by that virtual machine.
We partition the instruction set into ordinary instructions, whose behavior is completely determined by virtual-machine state, and enviroment instructions, whose
behavior is not.Examples of ordinary instructions include those for arithmetic and data movement; examples of enviroment instructions include those for reading the
time-of-day clock, loding the interval timer, and for performing I/O.
In order that the primary and backup virtual machines execute exactly the same sequence of instructions, both virtual machines are started in the same state.
By definition, the effects of executing ordinary instructions depend only on the virtual-machine state.
Another assumption ensures that when executing an environment instruction. instruction, the hypervisor at the primary and backup virtual machines have an opportunity to
communicate. This allows both hypervisors to change the virtual-machine state in the same way.
To guarantee that the primary and backup virtual machines execute the same sequence of instructions, we must
ensure that identical interrupts are delivered to each and at the same points in their instruction streams. The presence of a
hypervisor helps here. The primary’s hypervisor can buffer and forward I/O interrupts it receives to the backup’s hypervisor.
And, the primary’s hypervisor can send to the backup’s hypervisor information about the value of the interval timer at the pro-
cessor executing the primary virtual machine. Thus, by communicating with the primary’s hypervisor, the backup’s hypervisor
learns what interrupts it must deliver to the backup virtual machine,
This is because instruction-execution timing on most modem processors is unpredictable, Yet, interrupts must be
delivered at the same points in the primary and backup virtual machine instruction streams. We must employ some other mech-
anism for transferring control to the hypervisor when a virtual machine reaches a specified point in its instruction stream.
The recovery register on HP’s PA-RISC processors is a register that is decremented each time an instruction completes;
an interrupt is caused when the recovery register becomes negative.
hypervisor that uses the recovery register can thus ensure that epochs at the primary and backup virtual machines
each begin and end at exactly the same point in the instruction stream. Interrupts are delivered only on epoch boundaries.
A recovery register or some similar mechanism is, therefore, assumed.
By virtue of the Instruction-Stream Interrupt Assumption, execution of a virtual machine is partitioned into epochs, and
corresponding epochs at the primary and the backup virtual machines comprise the same sequences of instructions.
We have only to ensure that the same interrupts are delivered at the backup as at the primary
when each epoch ends. The solution to this is for the primary and backup hypervisor to communicate, and at the end of an
epoch i to have the backup’s hypervisor deliver copies of the interrupts that primary’s hypervisor delivered at the end of its
We now summarize the protocol that ensures the primary and backup virtual machines each performs the same sequence
of instructions and receives the same interrupts.
The protocol is presented as a set of routines that are implemented in the hypervisor. These routines may be activated concurrently.
through P7 ensure that the backup virtual machine executes the same sequence of instructions (each having
the same effect) as the primary virtual machine. PO through P7also ensure that if the primary virtual machine fails, then instruc-
tions executed by the backup extend the sequence of instructions executed by the primary.
PO through P7 do not guarantee that interrupts from I/O devices are not lost.
The state of the environment is affected by executing I/O instructions. We must ensure that the sequence of I/O instruc-
tions seen by the environment is consistent with what could be observed were a single processor in use, even though our l-fault-
tolerant virtual machine is built using two processors. The problem is best divided into two cases.
Our solution is to exploit the reality that I/O devices are themselves subject to transient failures, and device drivers already cope with these faihrres.
The effect of P8 is to cause certain I/O instructions to be repeated.
对原型系统进行测试。对CPU-Intensive Workload和I/O Workloads进行了测试，以Epoch Length为维度。
Flat Datacenter Storage (FDS) is a high-performance, fault-tolerant, large-scale, locality-oblivious blob store.
Using a novel combination of full bisection bandwidth networks, data and metadata striping, and ﬂow control,
FDS multiplexes an application’s large-scale I/O across the available throughput and latency budget of every disk
in a cluster. FDS therefore makes many optimizations around data locality unnecessary. Disks also commu-
nicate with each other at their full bandwidth, making recovery from disk failures extremely fast.
Counterintuitively, locality constraints can sometimes even hinder efﬁcient resource utilization.
recently developed CLOS networks–large numbers of small commodity switches with
redundant interconnections—have made it economical to build non-oversubscribed full bisection bandwidth networks at the scale of a datacenter for the ﬁrst time.
Flat Datacenter Storage (FDS) is a datacenter storage system designed from ﬁrst principles under the formerly unrealistic assumption that datacenter bandwidth is abundant.
Unconventionally for a system of this scale, FDS returns to the ﬂat storage model: all compute nodes can access
all storage with equal throughput. Even when computation is co-located with storage, all storage is treated
as remote; in FDS, there are no “local” disks.
FDS’ main goal is to expose all of a cluster’s disk bandwidth to applications. Blobs are divided into tracts
(§2.1), parallelizing both storage of data (§2.2) and handling of metadata (§2.3) across all disks.
FDS provides the best of both worlds: a scale-out system with aggregate
I/O throughput equal to systems that exploit local disks combined with the conceptual simplicity and ﬂexibility of a logically centralized storage array.
- FDS的主要目标是释放整个集群中所有磁盘的带宽。Blobs被分成tracts，data 和 metadata 被在所有disck中paralleizing/handling。
In FDS, data is logically stored in blobs. A blob is a byte sequence named with a 128-bit GUID.
Blobs can be any length up to the system’s storage capacity. Reads from and writes to a blob are done in units called tracts.
Tracts are sized such that random and sequential access achieves nearly the same throughput. In our cluster, tracts are 8MB.
- 每个blob的有一个唯一标识 128-bit GUID. blob被分为定长的tracts.
Every disk is managed by a process called a tractserver that services read and write requests that arrive
over the network from clients. Tractservers do not use a ﬁle system. Instead, they lay out tracts directly to disk by
using the raw disk interface. Since there are only about 10^6 tracts per disk (for a 1TB disk), all tracts’ metadata
is cached in memory, eliminating many disk accesses.
In addition to the listed parameters, each function takes a callback function and an associated context pointer. All
calls in FDS are non-blocking; the library invokes the application’s callback when the operation completes.
The application’s callback functions must be reentrant; they are called from the library’s threadpool and
may overlap. Tract reads are not guaranteed to arrive in order of issue. Writes are not guaranteed to be commit-
ted in order of issue. Applications with ordering requirements are responsible for issuing operations after pre-
vious acknowledgments have been received, rather than concurrently. FDS guarantees atomicity: a write is either
committed or failed completely.
The non-blocking API helps applications achievegood performance. The FDS API GetSimultaneousLimit() tells the application how many reads and writes to issue concurrently.
Getting access to a blob
Interacting with a blob
A key issue in parallel storage systems is data placement and rendezvous, that is: how does a writer know
where to send data? How does a reader ﬁnd data that has been previously written?
Many systems solve this problem using a metadata server that stores the location of data blocks.
This method has the advantage of allowing maximum ﬂexibility of data placement and visibility into the system’s state.
However, it has drawbacks: the metadata server is a central point of failure, usually implemented as a replicated
state machine, that is on the critical path for all reads and writes.
- 优点是灵活的data placement和可见的system’state
In FDS, we took a different approach. FDS uses a metadata server, but its role during normal operations
is simple and limited: collect a list of the system’s active tractservers and distribute it to clients. We call this
list the tract locator table, or TLT. In a single-replicated system, each TLT entry contains the address of a sin-
gle tractserver. With k-way replication, each entry has k tractservers;
Tract Locator = (Hash(g)+i) mod TLT Length
Once clients ﬁnd the proper tractserver address in the TLT, they send read and write requests containing the
blob GUID, tract number, and (in the case of writes) the data payload. Readers and writers rendezvous because
tractserver lookup is deterministic: as long as a reader has the same TLT the writer had when it wrote a tract, a
reader’s TLT lookup will point to the same tractserver.
In a single-replicated system, the TLT is constructed by concatenating m random permutations of the tract server list.
- 防止clients读取blob时，它们访问Tract Locator的顺序是一致的.
In the case of non-uniform disk speeds, the TLT is weighted so that different tractservers appear in proportion to the measured speed of the disk
To be clear, the TLT does not contain complete information about the location of individual tracts in the sys-
tem. It is not modiﬁed by tract reads and writes. The only way to determine if a tract exists is to contact the tract-
server that would be responsible for the tract if it does exist. Since the TLT changes only in response to clus-
ter reconﬁguration or failures it can be cached by clients for a long time. Its size in a single-replicated system is
proportional to the number of tractservers in the system (hundreds, or thousands), not the number of tracts stored
(millions or billions).
When the system is initialized, tractservers locally store their position in the TLT. This means the metadata
server does not need to store durable state, simplifying its implementation. In case of a metadata server failure, the
TLT is reconstructed by collecting the table assignments from each tractserver.
To summarize, our metadata scheme has a number of nice properties:
The metadata server is in the critical path only when a client process starts.
The TLT can be cached long-term since it changes only on cluster conﬁguration, not each read and
write, eliminating all trafﬁc to the metadata server in a running system under normal conditions
The metadata server stores metadata only about the hardware conﬁguration, not about blobs. Since traf-
ﬁc to it is low, its implementation is simple and lightweight.
Since the TLT contains random permutations of the list of tractservers, sequential reads and writes by
independent clients are highly likely to utilize all tractservers uniformly and are unlikely to organize
into synchronized convoys.
Each blob has metadata such as its length. FDS stores it in each blob’s special metadata tract (“tract −1”).
Clients ﬁnd a blob’s metadata on a tractserver using the same TLT used to ﬁnd regular data.
When a blob is created, the tractserver responsible for its metadata tract creates that tract on disk and initializes
the blob’s size to 0. When a blob is deleted, that tractserver deletes the metadata. A scrubber application scans
each tractserver for orphaned tracts with no associated metadata, making them eligible for garbage collection.
Newly created blobs have a length of 0 tracts. Applications must extend a blob before writing past the end
of it. The extend operation is atomic, is safe to execute concurrently with other clients, and returns the new size
of the blob as a result of the client’s call. A separate APItells the client the blob’s current size. Extend operations
for a blob are sent to the tractserver that owns that blob’s metadata tract.
In an n-disk cluster where one disk fails, roughly 1/n th of the replicated data will be found on
all n of the other disks. All remaining disks send the under-replicated data to each other in parallel, restoring
the cluster to full replication very quickly.
- 优秀的data placement策略保证1/n的副本数据可以在所有n个disk中找到，所有的disk开始并行传输副本，Recovery会非常快.
Such fast failure recovery significantly improves durability because it reduces the win-
dow of vulnerability during which additional failures can cause unrecoverable data loss.
- 快速的的failure recovery能够显著提供数据的持久性，因为它减少了vulnerability的窗口时间.
When an application writes a tract, the client library ﬁnds the appropriate row of the TLT and sends the write to every tract-
server it contains. Reads select a single tractserver at random. Applications are notiﬁed that their writes have
completed only after the client library receives write acknowledgments from all replicas.
Replication also requires changes to CreateBlob, ExtendBlobSize, and DeleteBlob. Each mutates the
state of the metadata tract and must guarantee that updates are serialized. Clients send these operations only
to the tractserver acting as the primary replica, marked as such in the TLT. When a tractserver receives one of
these operations, it executes a two-phase commit with the other replicas. The primary replica does not commit the
change until all other replicas have completed successfully. Should the prepare fail, the operation is aborted.
- 在 CreateBlob, ExtendBlobSize，DeleteBlob操作中，primary replica使用two-phase commit保证一致性.
FDS also supports per-blob variable replication, for example, to single-replicate intermediate computations
for write performance, triple-replicate archival data for durability, and over-replicate popular blobs to increase
read bandwidth. The maximum possible replication level is determined when the cluster is created and drives the
number of tractservers listed in each TLT entry. Each blob’s actual replication level is stored in the blob’smeta-
data tract and retrieved when a blob is opened. For an n-way replicated blob, the client uses only the ﬁrst n tract-
servers in each TLT entry
- FDS支持blob可变的副本数，每个blob的实际副本等级是保存在blob的元数据tract中，对于n副本的系统，client只使用TLT entry中地一个tractserver。
We begin with the simplest failure recovery case: the failure of a single tractserver.
Tractservers send heartbeat messages to the metadata server. When the metadata server detects a tractserver
timeout, it declares the tractserver dead. Then, it:
invalidates the current TLT by incrementing the version number of each row in which the failed tractserver appears;
picks random tractservers to ﬁll in the empty spaces in the TLT where the dead tractserver appeared;
sends updated TLT assignments to every server affected by the changes; and
waits for each tractserver to ack the new TLT assignments, and then begins to give out the new TLT to clients when queried for it.
When a tractserver receives an assignment of a new entry in the TLT, it contacts the other replicas and begins
copying previously written tracts. When a failure occurs, clients must wait only for the TLT to be updated; operations can continue while re-replication is still in progress.
All operations are tagged with the client’s TLT version. If a client attempts an operation using a stale TLT
entry, the tractserver detects the inconsistency and rejects the operation. This prompts the client to retrieve an updated TLT from the metadata server.
- 每个TLT entry都有一个version号，每个operation也有一个version号，当tractserver检查到version不一致时，它会拒绝执行operation。
Table versioning prevents a tractserver that failed and then returned, e.g., due to a transient network outage,
from continuing to interact with applications as soon as the client receives a new TLT. Further, any attempt by a
failed tractserver to contact the metadata server will result in the metadata server ordering its termination.
After a tractserver failure, the TLT immediately converges to provide applications the current location to read or write data.
- 当一个tractserver发生故障时， TLT可以立即收敛。
We now extend our description to concurrent and cascading tractserver failures as well as metadata server failures.
When multiple tractservers fail, the metadata server’s only new task is to ﬁll more slots in the TLT. Similarly, if
failure recovery is underway and additional tractservers fail, the metadata server executes another round of the
protocol by ﬁlling in empty table entries and incrementing their version. Data loss occurs when all the tractservers within a row fail within the recovery window.
- 当failure recovry进行时，又有tractserver发生故障。
A simple TLT might have n rows with each row listing disks i and i+1. While
data will be double-replicated, the cluster will not meet our goal of fast failure recovery: when a disk fails, its
backup data is stored on only two other disks (i+1 and i−1). Recovery time will be limited by the bandwidth of
just two disks. A cluster with 1TB disks with 100MB/s read performance would require 90 minutes for recovery.
A second failure within that time would have roughly a 2/n chance of losing data permanently.
A better TLT has O(n2) entries. Each possible pair of disks (ignoring failure domains; §3.3.1) appears in an
entry of the TLT. Since the generation of tract locators is pseudo-random (§2.2), any data written to a disk will
have high diversity in the location of its replica. When a disk fails, replicas of 1/nth of its data resides on the
other n disks in the cluster. When a disk fails, all n disks can exchange data in parallel over FDS’ full bisection
bandwidth network. Since all disks recover in parallel, larger clusters recover from disk loss more quickly.
While such a TLT recovers from single-disk failure quickly, a second failure while recovery is in progress is
guaranteed to lose data. Since all pairs of disks appear as TLT entries, any pair of failures will lose the tracts whose
TLT entry contained the pair of failed disks. Replicated FDS clusters therefore have a minimum replication level
of 3. Perhaps counterintuitively, no level of replication ever needs a TLT larger than O(n2). For any replica-
tion level k > 2, FDS starts with the “all-pairs” TLT, then expands each entry with k−2 additional random disks
(subject to failure domain constraints).
Constructing the replicated TLT this way has several important properties. First, performance during recovery
still involves every disk in the cluster since every pair of disks is still represented in the TLT.
Second, a triple disk failure within the recovery window now has only about a 2/n chance1 of causing permanent data loss.
To understand why, imagine two disks fail. Find the entries in the TLT that contain those two
disks. We expect to ﬁnd 2 such entries. There is a 1/n chance that a third disk failure will match the random
third disk in that TLT entry. Finally, adding more replicas decreases the probability of data loss. Consider now a 4-way replicated clus-
ter. Each entry in the O(n2)-length TLT has two random disks added instead of one. 3 or fewer simultaneous fail-
ures are safe; 4 simultaneous failures have a 1/n^2 chance of losing data. Similarly, 5-way replication means that
4 or fewer failures are safe and 5 simultaneous failures havea1/n^3 chance of loss.
One possible disadvantage to a TLT with O(n2) entries is its size. In our 1,000-disk cluster, the in-memory TLT
is about 13MB. However, on larger clusters, quadratic growth is cumbersome: 10,000 disks would require a
600MB TLT. We have two (unimplemented) strategies to mitigate TLT size. First, a tractserver can manage multiple disks;
this reduces n by a factor of 5–10. Second, we can limit the number of disks that participate in failure re-
covery. An O(n2) TLT uses every disk for recovery, but 3,000 disks are expected to recover 1TB in less than
20s (§5.3). The marginal utility of involving more disks may be small. To build an n-disk cluster where m disks
are involved in recovery, the TLT only needs O(XXX) entries. For 10,000 to 100,000 disks, this also reduces
table size by a factor of 5–10. Using both optimizations,a 100,000 disk cluster’s TLT would be a few dozen MB.
A failure domain is a set of machines that have a high probability of experiencing a correlated failure. Com-
mon failure domains include machines within a rack,since they will often share a single power source, or ma-
chines within a container, as they may share common cooling or power infrastructure.
FDS leaves it up to the administrator to deﬁne a failure domain policy for a cluster. Once that policy is de-
ﬁned, FDS adheres to that policy when constructing the tract locator table. FDS guarantees that none of the disks
in a single row of the TLT share the same failure domain. This policy is also followed during failure recov-
ery: when a disk is replaced, the new disk must be in a different failure domain than the other tractservers in that
FDS supports the dynamic growth of a cluster through the addition of new disks and machines. For simplicity,
we ﬁrst consider cluster growth in the absence of failures. Cluster growth adds both storage capacity and
throughput to the system. The FDS metadata server re-balances the assignment of table entries so that both ex-
isting data and new workloads are uniformly distributed. When a tractserver is added to the cluster, TLT entries are
taken away from existing tractservers and given to the new server. These assignments happen in two phases.
First, the new tractserver is given the assignments but they are marked as “pending” and the TLT version for
each entry is incremented. The new tractserver then be-gins copying data from other replicas. During this phase,
clients write data to both the existing servers and the new server so that the new tractserver is kept up-to-date. Once
the tractserver has completed copying old data, the meta-data server ‘commits’ the TLT entry by incrementing its
version and changing its assignment to the new tract-server.
It also notiﬁes the now replaced tractserver that it can safely garbage collect all tracts associated with that TLT entry. If a new tractserver fails while its TLT entries are
pending, the metadata server increments the TLT entry version and expunges it from the list of new tractservers.
If an existing server fails, the failure recovery protocol executes. However, tractservers with pending TLT en-
tries are not eligible to take over for failed servers as they are already busy copying data.
The current protocol for replication depends upon the client to issue all writes to all replicas. This decision
means that FDS provides weak consistency guarantees to clients. For example, if a client writes a tract to 1
of 3 replicas and then crashes, other clients reading dif- The current protocol for replication depends upon the
client to issue all writes to all replicas. This decision means that FDS provides weak consistency guarantees
to clients. For example, if a client writes a tract to 1 of 3 replicas and then crashes, other clients reading dif-
cas of that tract will observe differing state. Weak consistency guarantees are not uncommon; for ex-
ample, clients of the Google File System must handle garbage entries in ﬁles. However, if strong consis-
tency guarantees are desired, FDS could be modiﬁed to use chain replication to provide strong consistency
guarantees for all updates to individual tracts. Tractservers may also be inconsistent during failure
recovery. A tractserver recently assigned to a TLT entry will not have the same state as the entry’s other repli-
cas until data copying is complete. While in this state, tractservers reject read requests; clients use other repli-
- 可以使用chain replication提供强一致性.
你的任务是创建一个lock service，当一台server发生故障时，它还能正常工作。 lock service中维持了一组locks的状态。
你的lock service 应该运行(replicated)在两台servers上，一台作为primary，另外一台作为backup。
这个系统唯一需要tolerate的failure是”a single fail-stop server failure”。
Flat Datacenter Storage Nightingale, Elson, Fan, Hofmann, Howell, Suzue OSDI 2012
a cluster storage system stores giant blobs -- 128-bit ID, multi-megabyte content for big-data processing (like MapReduce) cluster of 1000s of computers processing data in parallel
lots of clients lots of storage servers ("tractservers") partition the data master ("metadata server") controls partitioning replica groups for reliability
1000s of disks of space store giant blobs, or many big blobs 1000s of servers/disks/arms of parallel throughput can expand over time -- reconfiguration large pool of storage servers for instance replacement after failure
does the abstract’s 2 GByte/sec per client seem impressive?
how fast can you read a file from Athena AFS? (abt 10 MB/sec)
how fast can you read a typical hard drive?
how fast can typical networks move data?
- 每个磁盘的速度大概是100MB/s,千兆网卡的速度可以达到100MB/s, 万兆网卡的速度可以达到1000MB/s。
abstract claims recover from lost disk (92 GB) in 6.2 seconds
that’s 15 GByte / sec
how is that even possible? that’s 30x the speed of a disk!
who might care about this metric?
- lost dick上每个数据的副本都保存在不同的磁盘上，Recovery在多个磁盘上同时进行。150个磁盘读，150个磁盘写，可以达到15GB/s.
- 使用128-bit最为blob ID，做HASH之后离散性更好，接口比较统一，可以被其他系统对接。
why not the UNIX i-node approach?
store an array per blob, indexed by tract #, yielding tractserver
so you could make per-tract placement decisions
e.g. write new tract to most lightly loaded server
- This effectively selected a TLT entry independently at random for each tract, producing a binomial rather than a uniform distribution of tracts on tractservers.
- Section 3.3中
- 不仅要考虑tract是否均匀分布在所有的TLT entries上，也要考虑tractserver是否均匀分布在所有的TLT entries上。
- 要保证在每个tractserver上在更多的TLT entry上，这样才能使得recovery速度更快。
why is the paper’s n^2 scheme better?
TLT with n^2 entries, with every server pair occuring once
How long will repair take?
What are the risks if two servers fail?
- 在n^2 shceme中，在TLT中有n^2个TLT entries。每个disk在2n个TLT entries中。
- 设所有tracts的总容量是V，则每个disk上有 2V/n 的数据。
- 当一个disk故障时，需要系统需要复制 2V/n 的数据，才能收敛到原来的状态。
- 在其他的每块disk上，都含有n分之一这2V/n 的数据。
- 重新生成2n个TLT entries，则剩下的disk就会在 2n+2 个TLT entries中，每个disk需要传输 4 个TLT entries的数据(写入2个，写出2个)，
大概需要是 4V/n^2 的数据量，大概需要 4V/(m*n^2) 的时间。
why do they actually use a minimum replication level of 3?
same n^2 table as before, third server is randomly chosen
What effect on repair time?
What effect on two servers failing?
What if three disks fail?
- 假设TLT的使用n^2 scheme,且副本数为2，则任意坏两块磁盘，则必定有两个TLT entry上的数据丢失(思考一下为什么)。
- 当两个disk故障时，不会有TLT entry上的数据丢失。
- 当两个disk故障时，在recovery window坏第三块disk，会有2/n的概率导致数据丢失。(思考为什么是2/n)
To increase the amount of disk space / parallel throughput Metadata server picks some random TLT entries Substitutes new server for an existing server in those TLT entries
- 假设TLT使用的是n^2 scheme，则每个disk需要有2V/n的数据，则需要花费 2V/(n*m) 的时间。(所有tracts的总容量是V, n是磁盘个数，m是磁盘的传输速率)
what about client writes while tracts are being transferred?
receiving tractserver may have copies from client(s) and from old srvr
how does it know which is newest?
- TLT entry有version号，假如tractserver发现version不对，会拒绝请求。
A writing client sends a copy to each tractserver in the TLT. A reading client asks one tractserver.
- 假如选择一个替换的disk, 数据都会写到这个disk中，disk的传输速度就是瓶颈。
- 需要 4V/(m*n^2) 的时间，详细解释看上文。
if a server crashes and reboots with disk intact, can contents be used?
e.g. if it only missed a few writes?
3.2.1’s “partial failure recovery”
but won’t it have already been replaced?
how to know what writes it missed?
- tansient failures, the tractserver later returns to service.
- 等待metadata server重启，并且重构TLT
- 可以, client上有TLT的缓存.
- Network partitions complicate recovery from metadata server failures. A simple primary/backup scheme is not safe because two active metadata servers
will cause cluster corruption. Our current solution is simple: only one metadata server is allowed to execute at a time. If it fails, an operator must
ensure the old one is decommissioned and start a new one. We are experimenting with using Paxos leader election to safely make this process
automatic and reduce the impact to availability. Once a new metadata server starts, tractservers contact it with
their TLT assignments. After a timeout period, the new metadata server reconstructs the old TLT.
- 遍历所有的tractservers，得到TLT entry的内容
- 因为元数据记录blob中每个tract是否存在. 当删除blob时，需要查询blob所有的tract是否存在，这会消耗很多时间。
how do we know we’re seeing “good” performance? what’s the best you can expect?
limiting resource for 2 GB / second single-client?
- Each computer in our cluster has a dual-port 10Gbps NIC.
where does the 92 GB in 6.2 seconds come from?
Table 1, 4th column
that’s 15 GB / second, both read and written
1000 disks, triple replicated, 128 servers?
what’s the limiting resource? disk? cpu? net?
- 理论上的时间是 4V/（mn^2）= 292TB/(100MB*1000^2) = 3.64 second
- 每个磁盘输出输入的 364MB, 假设每个server上有10个disk，则每个server输出输入的总流量是 3640MB