17 February 2010 - 14:30v0.19 released

The v0.19 release is finally here.  The focus this past cycle was on stability and the disk format, and things have improved greatly in that area.  Our plan is to make any future disk format changes roll forward, so that users won’t need to rebuild their file systems.  The protocol has also grown feature bits that so it is at least possible to make protocol changes transparent; whether we do so or not will depend on the severity of the change and cost of maintaining compatibility.

Overall, things are looking good.  If you’ve been standing on the sidelines waiting for something more stable to test, now is a good time to try things out.  There are some lingering OSD performance problems (see below), and we are still a long ways off from something we would recommend for use in a production environment, but otherwise this release is looking pretty good for evaluation purposes.

Changes since v0.18 include:

  • Stabilized disk format, with feature bits
  • Wire protocol feature bits
  • structure encoding versioning throughout
  • msgr: code simplification, cleanup, bug fixes
  • truncation fixes
  • debian: packaging improvements
  • rados: pool deletion, misc fixes
  • osd: recovery fixes, journaling fixes
  • lots of bug fixes (osd, mds, client)

On the kernel client side of things,

  • Support for Kerberos-like ‘cephx’ authentication
  • sync/directio read/write bug fixes (multiple client access to a single file)
  • writeback congestion control
  • mds ops interruptible (with control-c)
  • Lots of code cleanup
  • Lots of bug fixes

Notably, there are major revisions underway with the way the storage daemon cosd interacts with btrfs, and these are sufficiently intrusive and untested that they did not make it into this release.  They should be in v0.20.  That means that OSD performance is still not great in v0.19.  (So far performance with the new code is much much better.)

The primary focus areas for v0.20 will be

  • OSD performance and btrfs interface changes
  • Clustered MDS

Relevant URLs:

Enjoy!

posted by sage | No Comments | Tags: Releases

4 December 2009 - 10:52v0.18 released

There’s a v0.18 release to match the latest posting of the kernel client code on the Linux email lists.  If there are no final issues there, that will be what I send to Linus for 2.6.33.

Most of the changes since v0.17 are bug fixes in the MDS and kclient.  The main other item is an authentication framework to restrict access to the cluster and it’s services to authorized clients.  Two protocols/schemes are implemented: an AUTH_NONE framework that does no real authentication (and is essentially equivalent to what we’ve had until now) and a AUTH_CEPHX scheme that uses  Kerberos-like tickets to mutually authenticate clients and services.

Changes since v0.17 include:

  • osd: basic ENOSPC handling
  • big endian fixes
  • osd: improved object -> pg hash function; selectable
  • crush: selectable hash functions
  • mds restart bug fixes
  • kclient: mds reconnect bug fixes
  • fixed mds log trimming bug
  • fixed mds cap vs snap deadlock
  • filestore: faster flushing
  • uclient,kclient: snapshot fixes
  • mds: fixed recursive accounting bug
  • uclient: fixes for 32bit clients
  • auth: ‘none’ security framework
  • mon: safely bail on write errors (e.g. ENOSPC)
  • mds: fix replay/reconnect race (causing a fast client reconnect to fail)
  • mds: misc journal replay, session fixes

There is a known memory leak in the MDS in this release.  It should be fixed in the unstable git shortly.

Looking forward, the main items are:

  • stability
  • fixing a few pressing MDS performance issues
  • improving OSD interaction with btrfs (we may switch to using btrfs snapshots in place of the user transaction ioctls)
  • stability

Relevant URLs:

posted by sage | No Comments | Tags: Releases

19 October 2009 - 15:12v0.17 released

We’ve released v0.17.  This is mainly bug fixes and some monitor improvements.  Changes since v0.16 include:

  • kclient: fix >1 mds mdsmap decoding
  • kclient: fix mon subscription renewal
  • osdmap: fix encoding bug (and resulting kclient crash)
  • msgr: simplified policy, failure model, code
  • mon: less push, more pull
  • mon: clients maintain single monitor session, requests and replies are routed by the cluster
  • mon cluster expansion works (see Monitor cluster expansion)
  • osd: fix pgid parsing bug (broke restarts on clusters with 10 osds or more)

The other change with this release is that the kernel code is no longer bundled with the server code; it lives in a separate git tree.

posted by sage | No Comments | Tags: Releases

5 October 2009 - 15:12v0.16 released

We’ve released v0.16.  The release primarily incorporates feedback on the Linux kernel client from LKML.  Changes since v0.15 include:

  • kclient: corrected inline abuse, use of __init, sockaddr_storage (IPv6 groundwork), and other feedback
  • kclient: xattr cleanups
  • kclient: fix invalidate lockup bug
  • kclient: fix msgr queue accounting lockup bug

Andrew Morton was nice enough to take some time to look at v0.15 and, “unless others emit convincing squeaks,” suggested I ask Stephen to include it in linux-next and send Linus a pull request for 2.6.33.  Yay!  With luck this will be the last version spammed to LKML in its entirety.

Meanwhile, Yehuda is continuing work on the security infrastructure to provide mutual trust between monitors, MDSs, OSDs, and clients, and Greg is working some odds and ends (monitor cluster expansion, libceph/fuse/Hadoop client improvements).

Here are the relevant URLs:

P.S. I’d like to start building up to date RPMs as well.  If anyone wants to help get ceph.spec in sync with the debian packages, that would be great.

posted by sage | No Comments | Tags: Releases

22 September 2009 - 9:49v0.15 released

We’ve released v0.15.  This is mostly cleanups for the kernel client and some work on the monitor interface.  Changes since v0.14 include:

  • kclient: message api fixups (simpler, more robust)
  • kclient: more message pools (avoiding ENOMEM)
  • kclient: new ioctl to extract object name and location/address, given a file handle and offset
  • kclient: fix with osd restart handling
  • msgr: internal interface improvements (session tracking)
  • monitor: interface/protocol cleanup, better session tracking
  • monclient: lots of fixes, improvement
  • debian: fixed permissions on headres in -dev packages; new radosgw package (S3 compatible REST interface to object store)

So nothing too groundbreaking feature wise, mostly just bug fixes and internal code cleanups.  And the radosgw package, which lets you point existing applications using the S3 storage service at a Ceph object store.

Here are the relevant URLs:

posted by sage | No Comments | Tags: Releases

8 September 2009 - 14:39v0.14 released

We’ve released v0.14.  Changes since v0.13 include:

  • Messenger library changes (client now initiates all tcp connections)
  • Improved client/monitor protocol
  • Working Hadoop and Hypertable file system modules (many associated libceph, uclient fixes)
  • man page fixes
  • Debian packages fixed (now libcrush, libcrush-dev, librados, librados-dev, libceph, libceph-dev all work)
  • Streamlined client startup (fewer messages, faster client id assignment)

The messaging changes are the big item here.  They greatly simplify the implementation for the kernel client.  The monitor interface is also improved: clients maintain an open session and ’subscribe’ to map updates they want (generally, all MDS maps, and the next OSD map only when I/O stalls).  This also simplifies things on the monitor, and interestingly brings the monitor design somewhat closer to Zookeeper and CLD.

We’re currenting working on the security infrastructure (mutual authentication of clients, MDSs, OSDs, monitors), the Hadoop and Hypertable file system modules, and getting the kernel client in shape for a merge upstream.

Here are the relevant URLs:

posted by sage | No Comments | Tags: Releases

24 August 2009 - 9:55v0.13 released

We’ve made a v0.13 release.  This mostly fixes bugs with v0.12 that have come up over the past couple weeks:

  • [ku]lcient: fix sync read vs eof, lseek(…, SEEK_END)
  • mds: misc bug fixes for multiclient file access

But also a few other big things:

  • osd: stay active during backlog generation
  • osdmap: override mappings (pg_temp)
  • kclient: some improvements in kmalloc, memory preallocation

The OSD changes mean that the storage cluster can temporarily delegate authority for a placement group to the node that has the complete data while an index is being generated for recovery (that can take a while). Once that’s ready, control will fall back to the new/correct node and the usual recovery will kick in.

The disk format and wire protocols have changed with this version.

We’re continuing to work on the security infrastructure… hopefully will be ready for v0.14.

Here are the relevant URLs:

posted by sage | No Comments | Tags: Releases

5 August 2009 - 14:39v0.12 released

I’ve just tagged a v0.12 released, and sent the kernel client patchset off to the Linux kernel and fsdevel lists again.  There was a v0.11 a week ago as well that incorporated some earlier feedback from the kernel lists.

Changes since v0.11:

  • mapping_set_error on failed writepage
  • document correct debugfs mount point
  • simplify layout/striping ioctls
  • removed bad kmalloc in writepages
  • use mempools for writeback allocations where appropriate (*)
  • fixed a problem with capability, snap metadata writeback
  • cleaned up f(data)sync wrt metadata writeback
  • fixed a messenger bug causing random EBADF
  • some mds clustering fixes

And since v0.10:

  • server-specified max file size
  • kclient: simplified pr_debug macro
  • kclient: respond to control-c on mount
  • kclient: misc cleanups, fixes (LKML review)
  • mount updates /etc/mtab

Testing on our 100TB cluster is going well.  Planned items for v0.13 include:

  • improved availability of OSDs when cluster membership changes
  • client authentication
  • S3 compatible REST gateway for RADOS object store
  • Ceph file system module for Hadoop

* There are still some potential OOM situations during writeback from the messaging layer, but the fixes for that are planned for a bit later when it’s clear the messaging protocol isn’t going to change further.

posted by sage | No Comments | Tags: Releases

16 July 2009 - 9:40v0.10 released

We’ve released v0.10.  The big items this time around:

  • kernel client: some cleanup, unaligned memory access fixes
  • much debugging of MDS recovery: kernel client will now correctly untar, compile kernel with MDS server running in a 60 second restart loop.
  • a few misc mds fixes
  • osd recovery fixes
  • userspace client: many bug fixes, now quite stable
  • librados improvements

Also,

  • libceph: a thin wrapper around the POSIXy ceph interface

which is being used to write a file system ‘Broker’ for the Hypertable distributed database project.  We’re also planning on (finally) getting the Hadoop ceph client in working order.

We’re also continuing to work on the librados object storage layer, including a standalone fastcgi-based gateway exposing an S3-compatible restful interface, the goal being a drop-in replacement for apps using S3. (It won’t let you use the rados snapshots or object classes, though, and won’t scale as efficiently.)

As far as testing goes, we’re filling up a 100TB cluser locally and will start failure testing on that shortly.  And this past week we’ve been thorougly testing single-node) MDS recovery.  Next up is looping OSD restarts and power cycling.

Major todo items coming up next:

  • client authentication
  • additional metadata to facilitate catastrophic rebuild of fs hierarchy
  • stabilize clustered mds

We’ve also sent the Linux kernel client code off to LKML and -fsdevel again, and are continuing to work toward a merge into the mainline kernel.

UPDATE: Here are the relevant URLs:

posted by sage | 1 Comment | Tags: Releases

19 May 2009 - 13:28v0.8 released

Ceph v0.8 has been released.  Debian packages for amd64 and i386 have been built and there is a tarball, or you can pull the ‘master’ branch from Git.  This update has a lot of important protocol changes and corresponding performance improvements:

  • Client / MDS protocol simplification — faster, less fragile
  • Online adjustment of data and/or metadata replication
  • O_DIRECT support
  • Debug hooks moved from /proc to /debug (debugfs)
  • Faster xattrs
  • Faster readdir (client can cache the result)
  • Support for upcoming 2.6.30 kernel
  • Better error reporting on mount errors (permission, protocol version mismatches) or disk format mismatches
  • Lots and lots of bug fixes

Things have sped up significantly (single threaded dbench, for example, is almost twice as fast), and overall things are much less vulnerable to obscure race conditions.  MDS clustering is somewhat more stable (although still not stable enough to be recommended :) .  The most bug fixes, though, are in the distributed object storage layer’s failure recovery and data migration code. The next release is mostly going to focus on object storage.  We are cleaning up the interfaces and building a ‘librados’ (RADOS is the name for the object storage cluster) that provides a simple storage interface similar to S3.  More on that soon!

posted by sage | No Comments | Tags: Releases