Chubby provides an interface much like a distributed file system with advisory locks, but the design emphasis is on availability and reliability, as opposed to high performance. Many instances of the service have been used for over a year, with several of them each handling a few tens of thousands of clients concurrently. The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences. It is intended for use within a loosely-coupled distributed system consisting of moderately large numbers of small machines connected by a high-speed network. Most Chubby cells are confined to a single data centre or machine room, though we do run at least one Chubby cell whose replicas are separated by thousands of kilometres. The purpose of the lock service is to allow its clients to synchronize their activities and to agree on basic information about their environment.
|Published (Last):||23 March 2016|
|PDF File Size:||7.62 Mb|
|ePub File Size:||10.82 Mb|
|Price:||Free* [*Free Regsitration Required]|
Chubby provides an interface much like a distributed file system with advisory locks, but the design emphasis is on availability and reliability, as opposed to high performance.
Many instances of the service have been used for over a year, with several of them each handling a few tens of thousands of clients concurrently. The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences. It is intended for use within a loosely-coupled distributed system consisting of moderately large numbers of small machines connected by a high-speed network.
Most Chubby cells are confined to a single data centre or machine room, though we do run at least one Chubby cell whose replicas are separated by thousands of kilometres. The purpose of the lock service is to allow its clients to synchronize their activities and to agree on basic information about their environment. The primary goals included reliability, availability to a moderately large set of clients, and easy-to-understand semantics; throughput and storage capacity were considered secondary.
We expected Chubby to help developers deal with coarse-grained synchronization within their systems, and in particular to deal with the problem of electing a leader from among a set of otherwise equivalent servers.
In addition, both GFS and Bigtable use Chubby as a well-known and available location to store a small amount of meta-data; in effect they use Chubby as the root of their distributed data structures.
Some services use locks to partition work at a coarse grain between several servers. Before Chubby was deployed, most distributed systems at Google used ad hoc methods for primary election when work could be duplicated without harm , or required operator intervention when correctness was essential.
In the former case, Chubby allowed a small saving in computing effort. In the latter case, it achieved a significant improvement in availability in systems that no longer required human intervention on failure. Readers familiar with distributed computing will recognize the election of a primary among peers as an instance of the distributed consensus problem, and realize we require a solution using asynchronous communication; this term describes the behaviour of the vast majority of real networks, such as Ethernet or the Internet, which allow packets to be lost, delayed, and reordered.
Practitioners should normally beware of protocols based on models that make stronger assumptions on the environment. Indeed, all working protocols for asynchronous consensus we have so far encountered have Paxos at their core. Paxos maintains safety without timing assumptions, but clocks must be introduced to ensure liveness; this overcomes the impossibility result of Fischer et al. Building Chubby was an engineering effort required to fill the needs mentioned above; it was not research. We claim no new algorithms or techniques.
The purpose of this paper is to describe what we did and why, rather than to advocate it. We describe unexpected ways in which Chubby has been used, and features that proved to be mistakes. We omit details that are covered elsewhere in the literature, such as the details of a consensus protocol or an RPC system. A client Paxos library would depend on no other servers besides the name service , and would provide a standard framework for programmers, assuming their services can be implemented as state machines.
Indeed, we provide such a client library that is independent of Chubby. Nevertheless, a lock service has some advantages over a client library. First, our developers sometimes do not plan for high availability in the way one would wish.
Often their systems start as prototypes with little load and loose availability guarantees; invariably the code has not been specially structured for use with a consensus protocol. As the service matures and gains clients, availability becomes more important; replication and primary election are then added to an existing design.
While this could be done with a library that provides distributed consensus, a lock server makes it easier to maintain existing program structure and communication patterns. For example, to elect a master which then writes to an existing file server requires adding just two statements and one RPC parameter to an existing system: One would acquire a lock to become master, pass an additional integer the lock acquisition count with the write RPC, and add an if-statement to the file server to reject the write if the acquisition count is lower than the current value to guard against delayed packets.
We have found this technique easier than making existing servers participate in a consensus protocol, and especially so if compatibility must be maintained during a transition period. Second, many of our services that elect a primary or that partition data between their components need a mechanism for advertising the results. This suggests that we should allow clients to store and fetch small quantities of data--that is, to read and write small files. This could be done with a name service, but our experience has been that the lock service itself is well-suited for this task, both because this reduces the number of servers on which a client depends, and because the consistency features of the protocol are shared.
In particular, we found that developers greatly appreciated not having to choose a cache timeout such as the DNS time-to-live value, which if chosen poorly can lead to high DNS load, or long client fail-over times. Third, a lock-based interface is more familiar to our programmers. Both the replicated state machine of Paxos and the critical sections associated with exclusive locks can provide the programmer with the illusion of sequential programming.
However, many programmers have come across locks before, and think they know to use them. Ironically, such programmers are usually wrong, especially when they use locks in a distributed system; few consider the effects of independent machine failures on locks in a system with asynchronous communications.
Nevertheless, the apparent familiarity of locks overcomes a hurdle in persuading programmers to use a reliable mechanism for distributed decision making.
Last, distributed-consensus algorithms use quorums to make decisions, so they use several replicas to achieve high availability. For example, Chubby itself usually has five replicas in each cell, of which three must be running for the cell to be up. In contrast, if a client system uses a lock service, even a single client can obtain a lock and make progress safely. Thus, a lock service reduces the number of servers needed for a reliable client system to make progress.
In a loose sense, one can view the lock service as a way of providing a generic electorate that allows a client system to make decisions correctly when less than a majority of its own members are up. One might imagine solving this last problem in a different way: by providing a "consensus service", using a number of servers to provide the "acceptors" in the Paxos protocol.
However, assuming a consensus service is not used exclusively to provide locks which reduces it to a lock service , this approach solves none of the other problems described above. These arguments suggest two key design decisions: We chose a lock service, as opposed to a library or service for consensus, and we chose to serve small-files to permit elected primaries to advertise themselves and their parameters, rather than build and maintain a second service.
Some decisions follow from our expected use and from our environment: A service advertising its primary via a Chubby file may have thousands of clients.
Therefore, we must allow thousands of clients to observe this file, preferably without needing many servers. This suggests that an event notification mechanism would be useful to avoid polling. Even if clients need not poll files periodically, many will; this is a consequence of supporting many developers. Thus, caching of files is desirable.
Our developers are confused by non-intuitive caching semantics, so we prefer consistent caching. To avoid both financial loss and jail time, we provide security mechanisms, including access control.
A choice that may surprise some readers is that we do not expect lock use to be fine-grained, in which they might be held only for a short duration seconds or less ; instead, we expect coarse-grained use.
For example, an application might use a lock to elect a primary, which would then handle all access to that data for a considerable time, perhaps hours or days. These two styles of use suggest different requirements from a lock server. Coarse-grained locks impose far less load on the lock server.
In particular, the lock-acquisition rate is usually only weakly related to the transaction rate of the client applications. Coarse-grained locks are acquired only rarely, so temporary lock server unavailability delays clients less.
On the other hand, the transfer of a lock from client to client may require costly recovery procedures, so one would not wish a fail-over of a lock server to cause locks to be lost. Thus, it is good for coarse-grained locks to survive lock server failures, there is little concern about the overhead of doing so, and such locks allow many clients to be adequately served by a modest number of lock servers with somewhat lower availability. Fine-grained locks lead to different conclusions.
Even brief unavailability of the lock server may cause many clients to stall. Performance and the ability to add new servers at will are of great concern because the transaction rate at the lock service grows with the combined transaction rate of clients.
It can be advantageous to reduce the overhead of locking by not maintaining locks across lock server failure, and the time penalty for dropping locks every so often is not severe because locks are held for short periods.
Clients must be prepared to lose locks during network partitions, so the loss of locks on lock server fail-over introduces no new recovery paths. Chubby is intended to provide only coarse-grained locking. Fortunately, it is straightforward for clients to implement their own fine-grained locks tailored to their application. Little state is needed to maintain these fine-grain locks; the servers need only keep a non-volatile, monotonically-increasing acquisition counter that is rarely updated.
Clients can learn of lost locks at unlock time, and if a simple fixed-length lease is used, the protocol can be simple and efficient.
The most important benefits of this scheme are that our client developers become responsible for the provisioning of the servers needed to support their load, yet are relieved of the complexity of implementing consensus themselves. All communication between Chubby clients and the servers is mediated by the client library.
Figure 1: System structure A Chubby cell consists of a small set of servers typically five known as replicas, placed so as to reduce the likelihood of correlated failure for example, in different racks. The replicas use a distributed consensus protocol to elect a master; the master must obtain votes from a majority of the replicas, plus promises that those replicas will not elect a different master for an interval of a few seconds known as the master lease.
The master lease is periodically renewed by the replicas provided the master continues to win a majority of the vote. The replicas maintain copies of a simple database, but only the master initiates reads and writes of this database. All other replicas simply copy updates from the master, sent using the consensus protocol. Clients find the master by sending master location requests to the replicas listed in the DNS.
Non-master replicas respond to such requests by returning the identity of the master. Once a client has located the master, the client directs all requests to it either until it ceases to respond, or until it indicates that it is no longer the master.
Write requests are propagated via the consensus protocol to all replicas; such requests are acknowledged when the write has reached a majority of the replicas in the cell. Read requests are satisfied by the master alone; this is safe provided the master lease has not expired, as no other master can possibly exist. If a master fails, the other replicas run the election protocol when their master leases expire; a new master will typically be elected in a few seconds.
If a replica fails and does not recover for a few hours, a simple replacement system selects a fresh machine from a free pool and starts the lock server binary on it. It then updates the DNS tables, replacing the IP address of the failed replica with that of the new one. The current master polls the DNS periodically and eventually notices the change.
In the meantime, the new replica obtains a recent copy of the database from a combination of backups stored on file servers and updates from active replicas.
Once the new replica has processed a request that the current master is waiting to commit, the replica is permitted to vote in the elections for new master. It consists of a strict tree of files and directories in the usual way, with name components separated by slashes. The second component foo is the name of a Chubby cell; it is resolved to one or more Chubby servers via DNS lookup. Again following UNIX, each directory contains a list of child files and directories, while each file contains a sequence of uninterpreted bytes.
This significantly reduced the effort needed to write basic browsing and name space manipulation tools, and reduced the need to educate casual Chubby users. The design differs from UNIX in a ways that ease distribution.
To allow the files in different directories to be served from different Chubby masters, we do not expose operations that can move files from one directory to another, we do not maintain directory modified times, and we avoid path-dependent permission semantics that is, access to a file is controlled by the permissions on the file itself rather than on directories on the path leading to the file.
To make it easier to cache file meta-data, the system does not reveal last-access times. The name space contains only files and directories, collectively called nodes. Every such node has only one name within its cell; there are no symbolic or hard links.
BIGTABLE OSDI 06 PDF
Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key, it does not support general transactions unlike a standard RDBMS. Ossi Tweets My Tweets follow me on Twitter. This is an interesting topic. September 7, Besides having versions of data cells the user can also set a time-to-live on the stored data that allows to discard data after a specific amount of time. Bigtable is a large-scale petabytes of data across bkgtable of machines distributed storage system for managing structured data. Each region server in either system stores one modification log for all regions it hosts.
The Chubby lock service for loosely-coupled distributed systems
Rows The row keys in a table are arbitrary strings currently up to 64KB in size, although bytes is a typical size for most of our users. Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing.
Bigtable: A Distributed Storage System for Structured Data