22 September 2009 - 10:08Ceph talk at LCA2010

I’ll be giving a talk on Ceph at linux.conf.au 2010!  (Oddly enough, it’s in New Zealand this year, but I’m not complaining.)  I’ve heard great things about LCA, and am looking forward to being there.

The talk will cover two general areas: Ceph’s RADOS object storage architecture, including some of its data processing features, and the distributed file system that’s built on top of it.  The goal is to make it useful for administrators interested in a scalable file system, and developers working on cloud computing applications in needof a scalable storage and computing platform.

posted by sage | No Comments | Tags: Updates

12 March 2009 - 14:32More configuration improvements

We’ve updated the configuration framework (again) so that only a single configuration file is needed for the entire cluster.

The ceph.conf file consists of a global section, a section for each daemon type (e.g., mon, mds, osd), and a section for each daemon instance (e.g., mon0, mds.foo, osd12).  This allows you to specify options in a generic fashion where possible, using a few simple variable substitions, or in the section specific to the daemon type or daemon.  For example,

[global]
        pid file = /var/run/ceph/$name.pid
[osd]
        osd data = /data/osd$id
[osd0]
        host = node0
        debug osd = 10   ; just for this osd
[osd1]
        host = node1

and so forth. You can then distribute the file unmodified to all nodes, and on each machine the startup script will only pay attention to the daemons assignd to that host.

See the wiki for details.

posted by sage | 1 Comment | Tags: Updates

30 January 2009 - 12:46Some performance comparisons

I did a few basic tests comparing Ceph to NFS on a simple benchmark, a Linux kernel untar.  I tried to get as close as possible to an “apples to apples” comparison.  The same client machine is used for NFS and Ceph; another machine is either the NFS server or the Ceph MDS.  The same disk type is used for both tests.  The underlying file system for the NFS server was ext2. In the Ceph case, additional machines were used for the OSDs (each using btrfs).  Ceph came in somewhere in between NFS sync and async:

  • NFS async – ~60s
  • Ceph – ~90s
  • NFS sync – ~120s

The comparison isn’t really ideal for a number of reasons.  Most obviously, an NFS server is a single point of failure, while Ceph is going to great lengths to replicate all data on multiple nodes and to seamlessly tolerate the failure of any one of them (in this case, everything was replicated 2x).  Also, the NFS async case throws out all data safetly from the client’s perspective: an application fsync() is meaningless.  In contrast, although Ceph is operating somewhat asynchronously (for both metadata and data operations), an fsync() on a file or directory means what it is supposed to mean.

I can’t say that I’m all that pleased with these results (I was expecting things to be faster), but we’re not done yet.  For each file, Ceph is still expending two round trips to the MDS (to create and then to close the file) and one to the OSD (to write the data).  Although OSD op and the second MDS op are asynchronous, they still take time (the second MDS op in particular takes time on the MDS).  The eventual goal is to do file creation asynchronously by preallocating unused inode numbers to the client; that will allow the client to create and close the (already written) file with a single MDS op.  But this is a decent start for now.

I should mention that the OSD write latency has minimal impact on these numbers; both the MDS and client file data writeback do not typically block forward progress while waiting for IOs to complete.  Using expensive hardware (NVRAM) for the storage will improve other aspects of performance (particularly when multiple clients are accessing the same files, and the MDS does wait for changes to hit the journal), but it won’t have much effect on single client workloads like this one.

posted by sage | No Comments | Tags: Updates

20 January 2009 - 10:41POSIX file system test suite

The unstable client (with all of the async metadata changes) is passing the full POSIX file system test suite again (modulo the question of whether chmod -1,-1 should be a no-op or update ctime).  We’re also surviving long dbench runs.  Progress!  I hope to push this all into the master branch after a bit more testing, do some benchmarking, and then do a new release.

I was happy to discover that the test suite has a real home now:

http://www.ntfs-3g.org/pjd-fstest.html

posted by sage | No Comments | Tags: Updates

18 August 2008 - 12:27Snapshots are now in the ‘unstable’ branch

I’ve just merged the snapshot support into the unstable branch.  It’s not
completely finished yet (garbage collection and handling for a number of
corner cases is still missing), but provided you don’t actually
create/destroy any snapshots, things will behave as before.

I’m merging this now because there was relatively extensive surgery on the
MDS to include this support, and I’d like to shake out the resulting bugs
sooner rather than later.  This also precipitated a lot of cleanups and
bug fixes.

If you’d like to try out the snapshots, it’s pretty fun.  The main caveat
is you snapshots on the fs root directory don’t work right… it has to be
a subdirectory.  Something like so:

$ mount -t ceph 1.2.3.4:/ /ceph
$ mkdir /ceph/foo
$ touch /ceph/foo/asdf
$ mkdir /ceph/foo/.snap/my_first_snapshot
$ rm /ceph/foo/asdf
$ ls -al /ceph/foo/.snap/* /ceph/foo

‘rmdir’ will remove a snapshot (although disk space isn’t being reclaimed
yet).

posted by sage | No Comments | Tags: Updates

16 June 2008 - 14:52Recursive accounting

This is somewhat old news, but the recursive accounting changes have been merged into both the ‘unstable’ and ‘master’ branches, and the feature is documented in the wiki.

I’m extremely curious what people think of this feature (useful? confusing?).  It takes liberties with two common behaviors of directories: first, with the “rbytes” mount option, the directory size is suddenly related to the directory’s recursive contents, and may appear very large.  Second, doing “cat dir” will dump the directory’s full stats instead of returning -EISDIR (Is a directory).  I’m hoping the latter behavior change is harmless, given that until relatively recently reading a directory dumped the encoded directory contents to your terminal…

posted by sage | No Comments | Tags: Updates

23 May 2008 - 14:49Stable (’master’) branch updated

I’ve just merged a bunch of recent changes into the ‘master’ branch in git.  The big items are

  • lots of kernel client fixes
  • improved stability of NFS re-export of a ceph client mount
  • xattrs
  • various OSD failure recovery fixes, and a corruption bug fix in EBOFS
  • a big cleanup of the userspace client code, to bring it in line with the kernel client implementation
  • endian and wordsize safety (freely mix x86 and x86_64, etc)

Big thanks go to Brent Nelson at UFL for his tireless testing and countless bug reports.

posted by sage | No Comments | Tags: Releases, Updates

7 May 2008 - 14:05v0.2 Released

The kernel client is holding up well under the latest round of testing, so I’ve cut a v0.2 release and am sending an announcement to LKML and linux-fsdevel. Notable in this release:

  • fully functional and reasonably stable kernel client
  • NFS re-export of a ceph client mount
  • client metadata leases for strict client cache coherency and improved performance
  • crushtool for managing storage cluster topology
  • improved support for storage cluster expansion
  • new tools for mkfs and management
  • lots and lots of bug fixes

So far, the v0.3 todo list includes:

  • hardening OSD distributed failure recovery code
  • xattrs
  • large directory support (in client)
  • recursive mtime and file size (i.e., directory size is sum of sizes of all files and subdirectories)

Please grab the source and take a look.

posted by sage | No Comments | Tags: Releases, Updates

21 April 2008 - 8:30Kernel client update

The Linux kernel client has stabilized to the point where you can untar and build a kernel source tree (and unmount it cleanly) without any problems.  Yay!

posted by sage | No Comments | Tags: Updates

7 April 2008 - 18:56Well-behaved writeback

We reached an exciting milestone for the Ceph kernel client this week: file data writeback is working, and well-behaved.  In particular, the speed of a tar file extraction is limited primarily by MDS latency for file creation.  File data is written asynchronously to OSDs in nice big I/Os (based on the striping parameters… 4 MB objects by default).  File capabilities are released to the MDS only after all dirty data is written, and intervening operations (e.g. a file stat by another client) will properly pause other clients with the file open in order to return a correct result.

One of the nice side effects here is that some write operations can be performed safely without MDS interaction.  In the case of a tar extraction, the utimes() call that sets file mtime and ctime simply updates the client’s local values since it still holds an exclusive capability for those inode fields; once data is flushed out to OSDs, the capabilities are released to the MDS along with the correct size, mtime, and atime values.

In any case, the kernel client is pretty stable and usable now!

posted by sage | No Comments | Tags: Updates