6 June 2009 - 14:36RADOS snapshots

Some interesting issues came up when we started considering how to expose the RADOS snapshot functionality to librados users.  The object store exposes a pretty low-level interface to control when objects are cloned (i.e. when an object snapshot is taken via the btrfs copy-on-write ioctls).  The basic design in Ceph is that the client provides a “SnapContext” with each write operation that indicates which snapshots logically exist for the given object; if the version already stored by the OSD is older than the newest snapshot in the SnapContext, a clone is created before the write is applied.  It is the Ceph MDS’s responsibility to keep track of which snapshots apply to which objects (remember, Ceph lets you snapshot any subdirectory) and to do all the synchronization that ensures mounted clients have up to date SnapContexts.

In creating a raw object storage interface, how is that underlying functionality best exposed?  One option is to expose some functions that allow users to create, manipulate, and possibly store SnapContexts, and manually specify a context for each write (or a snapshot id to read).  This exposes the same functionality Ceph makes use of, but essentially drops all of the issues with synchronization and storage in librados user’s lap.  How should one go about keeping multiple processes accessing the RADOS store in sync (i.e. agreeing on which snapshots exist) to get the semantics people want?

Our solution is to introduce some basic snapshot accounting to RADOS.  We allow per-pool snapshots to be created via RADOS itself, and include that snap information in the OSDMap (the global data structure used to synchronize the activities of OSDs and clients).  If a client performs a write and does not manually specify a SnapContext (as Ceph does), an appropriate context will be generated from the pool snapshot information in the OSDMap.

Snapshot creation is done via the monitor, either via a librados API call or an administrator command like ‘ceph osd pool mksnap poolname snapname’.  This updates the OSDMap to include the new snap for that pool, and that map propagates across the cluster.

int rados_snap_create(rados_pool_t pool, const char *snapname);
int rados_snap_remove(rados_pool_t pool, const char *snapname);
int rados_snap_list(rados_pool_t pool, rados_snapid_t *snaps, int maxlen);
int rados_snap_get_name(raods_pool_t pool, rados_snapid_t id, char *name, int maxlen);

To read an existing snapshot, a new RADOS pool context is opened and a specific snapshot id is selected (the id can be obtained via rados_snap_list above):

rados_pool_t snapped_pool;
rados_open_pool(”data”, &snapped_pool);
rados_set_snap(snapped_pool, 2);

Subsequent reads via the snapped_pool handle will return data from snapid 2, and any attempts to write will return -EROFS (Read-only file system).  Reading and writing via other rados_pool_t handles will be unaffected.  By default any newly opened pool handle will be “positioned” at the “head”–the current, writeable version of the object pool.

Map propagation is fast, but not synchronous: it is possible for one client to create a snapshot and for another client to then perform a write that does not preserve some data in the new snap.  So we do not completely solve the synchronization problem for you to create a global, ‘instantaneous’ point-in-time snapshot.  Doing so in a large distributed environment with many clients and many servers, operating in parallel, is a challenge in any system.

From the perspective of the client creating the snapshot, however, the snapshot is ordered with respect to IO performed before and after rados_snap_create.   RADOS already does some synchronization with respect to OSDMap updates to ensure that readers, writers and OSDs all agree on the current state of a placement group when performing IO.  Any IO initiated after the snapshot is created will be tagged with the new OSDMap version, and any OSD will make sure it has either the same or a newer version of the map before performing that IO.  Other clients will not see a clear ordering unless the librados user takes steps to coordinate clients such that they all obtain the updated OSDMap (describing the new snapshot) before performing new IO.

If there is demand, we may still expose an API to manipulate raw SnapContexts for advanced users wanting different snapshot schedules for different objects.  It will be their responsibility to manage all client synchronization in that case, as that snapshot information won’t be propagated via the OSDMap.

For anybody wanting perfect cluster-wide point-in-time snapshots without any client coordination… well, sorry.  Experience with file system snapshots has shown that proper synchronization is never something that the storage system alone can get right due to caching at all layers of the system.  NFS client write-back caches make server-based snapshots (e.g., NetApp filers) imperfect.  Snapshots in local file systems utilize some kernel machinery to momentarily quiesce all IO while the snapshot is created, but even applications may not have the on-disk files (as seen by the OS) in a consistent state.  Coordination with applications is always necessary for any fully ‘correct’ solution, so we won’t try to solve the whole problem based on some false sense of what ‘correct’ is.

posted by sage | No Comments | Tags: Dev notes, RADOS

19 May 2009 - 15:04The RADOS distributed object store

The Ceph architecture can be pretty neatly broken into two key layers.  The first is RADOS, a reliable autonomic distributed object store, which provides an extremely scalable storage service for variably sized objects.  The Ceph file system is built on top of that underlying abstraction: file data is striped over objects, and the MDS (metadata server) cluster provides distributed access to a POSIX file system namespace (directory hierarchy) that’s ultimately backed by more objects.

Until now, RADOS’ only user has been Ceph.  But if the success of Amazon’s S3 (simple storage service) has shown nothing else, it’s that there is ample use (and demand) for a reliable and scalable object-based storage interface.

The underlying storage abstraction provided by RADOS is relatively simple:

  • The unit of storage is an object.  Each object has a name (currently a fixed-size 20 byte identifier, though that may change), some number of named attributes (i.e., xattrs), and a variable-sized data payload (like a file).
  • Objects are stored in object pools.  Each pool has a name (e.g. “foo”) and forms a distinct object namespace.  Each pool also has a few parameters that define how the object is stored, namely a replication level (2x, 3x, etc.) and a mapping rule describing how replicas should be distributed across the storage cluster (e.g., each replica in a separate rack).
  • The storage cluster consists of some (potentially large) number of storage servers, or OSDs (object storage daemons/devices), and the combined cluster can store any number of pools.

A key design feature of RADOS is that the OSDs are able to operate with a relative autonomy when it comes to recovering from failures or migrating data in response to cluster expansion.  By minimizing the role of the central cluster coordinator (actually a small Paxos cluster managing key cluster state), the overall system is extremely scalable.  A small system of a few nodes can seamlessly grow to hundreds or thousands of nodes (or contract again) as needed.

The API provided by librados will be quite simple.  Something along the lines of:

/* initialization */
int rados_initialize(int argc, const char **argv);
void rados_deinitialize();

int rados_open_pool(const char *name, rados_pool_t *pool);
void rados_close_pool(rados_pool_t pool);

int rados_write(rados_pool_t pool, struct ceph_object *oid, const char *buf, off_t off, size_t len);
int rados_read(rados_pool_t pool, struct ceph_object *oid, char *buf, off_t off, size_t len);

An asynchronous I/O interface will also be exposed, as well as a buffering/caching facility (currently in use by the Ceph fuse client) with the ability to selectively flush/invalidate sets of objects (e.g., the set of objects a file is striped over).

What are the benefits of using this sort of interface?  Clearly, anything you can do with objects you can also do with files in a distributed fs (like Ceph): just create a file at /foo/$poolname/$objectname.

  • Simplicity — many applications don’t need all of the complexities provided by a POSIX file system.  That, in turn, means an object store can optimize for a much simpler interface and workload
  • Scalability — most of the problems with making distributed file systems scale over large numbers of storage nodes are related to the rules imposed by POSIX (case in point: Ceph’s MDS is quite complex).  A simple object abstraction is much more scalable.
  • Stability — simple systems are much easier to validate.

One goal is to make applications that currently use the S3 client library trivially portable to librados, allowing users to maintain control of the full storage stack.

posted by sage | No Comments | Tags: Dev notes, RADOS

19 May 2009 - 13:28v0.8 released

Ceph v0.8 has been released.  Debian packages for amd64 and i386 have been built and there is a tarball, or you can pull the ‘master’ branch from Git.  This update has a lot of important protocol changes and corresponding performance improvements:

  • Client / MDS protocol simplification — faster, less fragile
  • Online adjustment of data and/or metadata replication
  • O_DIRECT support
  • Debug hooks moved from /proc to /debug (debugfs)
  • Faster xattrs
  • Faster readdir (client can cache the result)
  • Support for upcoming 2.6.30 kernel
  • Better error reporting on mount errors (permission, protocol version mismatches) or disk format mismatches
  • Lots and lots of bug fixes

Things have sped up significantly (single threaded dbench, for example, is almost twice as fast), and overall things are much less vulnerable to obscure race conditions.  MDS clustering is somewhat more stable (although still not stable enough to be recommended :).  The most bug fixes, though, are in the distributed object storage layer’s failure recovery and data migration code. The next release is mostly going to focus on object storage.  We are cleaning up the interfaces and building a ‘librados’ (RADOS is the name for the object storage cluster) that provides a simple storage interface similar to S3.  More on that soon!

posted by sage | No Comments | Tags: Releases

12 March 2009 - 14:32More configuration improvements

We’ve updated the configuration framework (again) so that only a single configuration file is needed for the entire cluster.

The ceph.conf file consists of a global section, a section for each daemon type (e.g., mon, mds, osd), and a section for each daemon instance (e.g., mon0, mds.foo, osd12).  This allows you to specify options in a generic fashion where possible, using a few simple variable substitions, or in the section specific to the daemon type or daemon.  For example,

[global]
        pid file = /var/run/ceph/$name.pid
[osd]
        osd data = /data/osd$id
[osd0]
        host = node0
        debug osd = 10   ; just for this osd
[osd1]
        host = node1

and so forth. You can then distribute the file unmodified to all nodes, and on each machine the startup script will only pay attention to the daemons assignd to that host.

See the wiki for details.

posted by sage | 1 Comment | Tags: Updates

11 March 2009 - 14:32dbench performance

Yehuda and I did some performance tuning with dbench a couple weeks back and made some significant improvements.  Here are the rough numbers, before I forget.  We were testing on a simple client/server setup to make a reasonable comparison with NFS: single server on a single SATA disk, and a single client. Since we were mainly interested in metadata latency, we were using just a single thread (’dbench 1′).

  • sync NFS  ~2.5 MB/sec
  • Ceph ~7 MB/sec
  • local disk on server ~11 MB/sec
  • async NFS ~13 MB/sec

The async NFS was presumably faster than the local disk because the fsync() (or close()) wasn’t really waiting for anything to be flushed to disk on the server.  Considering Ceph started out around 2 MB/sec two days earlier, we were pretty happy, and there’s still room for improvement.

posted by sage | 2 Comments | Tags: Dev notes

9 March 2009 - 20:20v0.7 release

I’ve tagged a v0.7 release.  Probably the biggest change in this release (aside from the usual bug fixes and performance improvements) is the new start/stop and configuration framework.  Notably, the entire cluster configuration can be described by a single cluster.conf file that is shared by all nodes (distributed via scp or NFS or whatever) and used for mkfs, startup, and shutdown.

New in v0.7:

  • smart osd ’sync’ behavior
  • osd bug fixes
  • fast truncate strategy
  • improved start/stop scripts
  • new cluster configuration framework

Source tarballs are at http://ceph.newdream.net/download, debian packages are at http://ceph.newdream.net/debian, and the source repository can be cloned via git.

posted by sage | No Comments | Tags: Releases

6 March 2009 - 9:54New configuration and startup framework

Yehuda and I spent last week polishing his configuration framework and reworking the way everything is configured and started up.  I think the end result is pretty slick:

There are now two configuration files.  The first, cluster.conf, defines which hosts participate in the cluster, which daemons run on which hosts, and what paths are used to store data.  It is used by the /etc/init.d/ceph init script (src/init-ceph in the git tree) and mkcephfs.  The trick is that the cluster.conf defines daemon startup parameters for the entire cluster, but by default the init script only pays attention to those assigned to the local host, allowing you do distribute the same file across the cluster without adjusting it for each host.  Alternatively, the -a switch (e.g. /etc/init.d/ceph -a start) will start (or stop) daemons on all hosts via ssh.

The ceph.conf file defines runtime parameters, like debugging levels, log locations, and thread pool sizes, and so forth.  By default everything looks at /etc/ceph/ceph.conf, or you can specify a separate configuration file on a per-daemon basis via the cluster.conf.

The new framework is described in detail in the wiki.

UPDATE: Okay, we’ve since revised this scheme to use a single ceph.conf file.  Even better.

posted by sage | 1 Comment | Tags: Dev notes

24 February 2009 - 14:16Debian packages

I’ve built some debian packages for both the userspace daemons and the kernel module source.  Trying things out is now as simple as adding a few lines to your apt sources file and doing an apt-get install!  More info in the wiki.

posted by sage | 4 Comments | Tags: Uncategorized

30 January 2009 - 12:46Some performance comparisons

I did a few basic tests comparing Ceph to NFS on a simple benchmark, a Linux kernel untar.  I tried to get as close as possible to an “apples to apples” comparison.  The same client machine is used for NFS and Ceph; another machine is either the NFS server or the Ceph MDS.  The same disk type is used for both tests.  The underlying file system for the NFS server was ext2. In the Ceph case, additional machines were used for the OSDs (each using btrfs).  Ceph came in somewhere in between NFS sync and async:

  • NFS async - ~60s
  • Ceph - ~90s
  • NFS sync - ~120s

The comparison isn’t really ideal for a number of reasons.  Most obviously, an NFS server is a single point of failure, while Ceph is going to great lengths to replicate all data on multiple nodes and to seamlessly tolerate the failure of any one of them (in this case, everything was replicated 2x).  Also, the NFS async case throws out all data safetly from the client’s perspective: an application fsync() is meaningless.  In contrast, although Ceph is operating somewhat asynchronously (for both metadata and data operations), an fsync() on a file or directory means what it is supposed to mean.

I can’t say that I’m all that pleased with these results (I was expecting things to be faster), but we’re not done yet.  For each file, Ceph is still expending two round trips to the MDS (to create and then to close the file) and one to the OSD (to write the data).  Although OSD op and the second MDS op are asynchronous, they still take time (the second MDS op in particular takes time on the MDS).  The eventual goal is to do file creation asynchronously by preallocating unused inode numbers to the client; that will allow the client to create and close the (already written) file with a single MDS op.  But this is a decent start for now.

I should mention that the OSD write latency has minimal impact on these numbers; both the MDS and client file data writeback do not typically block forward progress while waiting for IOs to complete.  Using expensive hardware (NVRAM) for the storage will improve other aspects of performance (particularly when multiple clients are accessing the same files, and the MDS does wait for changes to hit the journal), but it won’t have much effect on single client workloads like this one.

posted by sage | No Comments | Tags: Updates

20 January 2009 - 10:41POSIX file system test suite

The unstable client (with all of the async metadata changes) is passing the full POSIX file system test suite again (modulo the question of whether chmod -1,-1 should be a no-op or update ctime).  We’re also surviving long dbench runs.  Progress!  I hope to push this all into the master branch after a bit more testing, do some benchmarking, and then do a new release.

I was happy to discover that the test suite has a real home now:

http://www.ntfs-3g.org/pjd-fstest.html

posted by sage | No Comments | Tags: Updates