19 October 2009 - 15:12v0.17 released

We’ve released v0.17.  This is mainly bug fixes and some monitor improvements.  Changes since v0.16 include:

  • kclient: fix >1 mds mdsmap decoding
  • kclient: fix mon subscription renewal
  • osdmap: fix encoding bug (and resulting kclient crash)
  • msgr: simplified policy, failure model, code
  • mon: less push, more pull
  • mon: clients maintain single monitor session, requests and replies are routed by the cluster
  • mon cluster expansion works (see Monitor cluster expansion)
  • osd: fix pgid parsing bug (broke restarts on clusters with 10 osds or more)

The other change with this release is that the kernel code is no longer bundled with the server code; it lives in a separate git tree.

posted by sage | No Comments | Tags: Releases

16 October 2009 - 12:55Kernel client git trees have moved

The kernel client git trees have moved to kernel.org.  The main line of development is in a kernel tree that contains the Ceph client:

 git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git

Generally speaking, the master branch will contain stable code that is ready to be pushed upstream, while the unstable branch has the bleeding edge (and may be rebased).

There is also a git tree containing just the Ceph module source.  It mirrors commits from the main tree (for fs/ceph/* only), so there is a useful history, and it also contains ‘backport’ branches that will build on older kernels.

git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client-standalone.git

The userspace server side code (ceph.git) hasn’t moved; it’s still at

git://ceph.newdream.net/ceph.git

Enjoy!

posted by sage | No Comments | Tags: Uncategorized

5 October 2009 - 15:12v0.16 released

We’ve released v0.16.  The release primarily incorporates feedback on the Linux kernel client from LKML.  Changes since v0.15 include:

  • kclient: corrected inline abuse, use of __init, sockaddr_storage (IPv6 groundwork), and other feedback
  • kclient: xattr cleanups
  • kclient: fix invalidate lockup bug
  • kclient: fix msgr queue accounting lockup bug

Andrew Morton was nice enough to take some time to look at v0.15 and, “unless others emit convincing squeaks,” suggested I ask Stephen to include it in linux-next and send Linus a pull request for 2.6.33.  Yay!  With luck this will be the last version spammed to LKML in its entirety.

Meanwhile, Yehuda is continuing work on the security infrastructure to provide mutual trust between monitors, MDSs, OSDs, and clients, and Greg is working some odds and ends (monitor cluster expansion, libceph/fuse/Hadoop client improvements).

Here are the relevant URLs:

P.S. I’d like to start building up to date RPMs as well.  If anyone wants to help get ceph.spec in sync with the debian packages, that would be great.

posted by sage | No Comments | Tags: Releases

22 September 2009 - 10:08Ceph talk at LCA2010

I’ll be giving a talk on Ceph at linux.conf.au 2010!  (Oddly enough, it’s in New Zealand this year, but I’m not complaining.)  I’ve heard great things about LCA, and am looking forward to being there.

The talk will cover two general areas: Ceph’s RADOS object storage architecture, including some of its data processing features, and the distributed file system that’s built on top of it.  The goal is to make it useful for administrators interested in a scalable file system, and developers working on cloud computing applications in needof a scalable storage and computing platform.

posted by sage | No Comments | Tags: Updates

22 September 2009 - 9:49v0.15 released

We’ve released v0.15.  This is mostly cleanups for the kernel client and some work on the monitor interface.  Changes since v0.14 include:

  • kclient: message api fixups (simpler, more robust)
  • kclient: more message pools (avoiding ENOMEM)
  • kclient: new ioctl to extract object name and location/address, given a file handle and offset
  • kclient: fix with osd restart handling
  • msgr: internal interface improvements (session tracking)
  • monitor: interface/protocol cleanup, better session tracking
  • monclient: lots of fixes, improvement
  • debian: fixed permissions on headres in -dev packages; new radosgw package (S3 compatible REST interface to object store)

So nothing too groundbreaking feature wise, mostly just bug fixes and internal code cleanups.  And the radosgw package, which lets you point existing applications using the S3 storage service at a Ceph object store.

Here are the relevant URLs:

posted by sage | No Comments | Tags: Releases

8 September 2009 - 14:39v0.14 released

We’ve released v0.14.  Changes since v0.13 include:

  • Messenger library changes (client now initiates all tcp connections)
  • Improved client/monitor protocol
  • Working Hadoop and Hypertable file system modules (many associated libceph, uclient fixes)
  • man page fixes
  • Debian packages fixed (now libcrush, libcrush-dev, librados, librados-dev, libceph, libceph-dev all work)
  • Streamlined client startup (fewer messages, faster client id assignment)

The messaging changes are the big item here.  They greatly simplify the implementation for the kernel client.  The monitor interface is also improved: clients maintain an open session and ’subscribe’ to map updates they want (generally, all MDS maps, and the next OSD map only when I/O stalls).  This also simplifies things on the monitor, and interestingly brings the monitor design somewhat closer to Zookeeper and CLD.

We’re currenting working on the security infrastructure (mutual authentication of clients, MDSs, OSDs, monitors), the Hadoop and Hypertable file system modules, and getting the kernel client in shape for a merge upstream.

Here are the relevant URLs:

posted by sage | No Comments | Tags: Releases

24 August 2009 - 9:55v0.13 released

We’ve made a v0.13 release.  This mostly fixes bugs with v0.12 that have come up over the past couple weeks:

  • [ku]lcient: fix sync read vs eof, lseek(…, SEEK_END)
  • mds: misc bug fixes for multiclient file access

But also a few other big things:

  • osd: stay active during backlog generation
  • osdmap: override mappings (pg_temp)
  • kclient: some improvements in kmalloc, memory preallocation

The OSD changes mean that the storage cluster can temporarily delegate authority for a placement group to the node that has the complete data while an index is being generated for recovery (that can take a while). Once that’s ready, control will fall back to the new/correct node and the usual recovery will kick in.

The disk format and wire protocols have changed with this version.

We’re continuing to work on the security infrastructure… hopefully will be ready for v0.14.

Here are the relevant URLs:

posted by sage | No Comments | Tags: Releases

5 August 2009 - 14:39v0.12 released

I’ve just tagged a v0.12 released, and sent the kernel client patchset off to the Linux kernel and fsdevel lists again.  There was a v0.11 a week ago as well that incorporated some earlier feedback from the kernel lists.

Changes since v0.11:

  • mapping_set_error on failed writepage
  • document correct debugfs mount point
  • simplify layout/striping ioctls
  • removed bad kmalloc in writepages
  • use mempools for writeback allocations where appropriate (*)
  • fixed a problem with capability, snap metadata writeback
  • cleaned up f(data)sync wrt metadata writeback
  • fixed a messenger bug causing random EBADF
  • some mds clustering fixes

And since v0.10:

  • server-specified max file size
  • kclient: simplified pr_debug macro
  • kclient: respond to control-c on mount
  • kclient: misc cleanups, fixes (LKML review)
  • mount updates /etc/mtab

Testing on our 100TB cluster is going well.  Planned items for v0.13 include:

  • improved availability of OSDs when cluster membership changes
  • client authentication
  • S3 compatible REST gateway for RADOS object store
  • Ceph file system module for Hadoop

* There are still some potential OOM situations during writeback from the messaging layer, but the fixes for that are planned for a bit later when it’s clear the messaging protocol isn’t going to change further.

posted by sage | No Comments | Tags: Releases

16 July 2009 - 9:40v0.10 released

We’ve released v0.10.  The big items this time around:

  • kernel client: some cleanup, unaligned memory access fixes
  • much debugging of MDS recovery: kernel client will now correctly untar, compile kernel with MDS server running in a 60 second restart loop.
  • a few misc mds fixes
  • osd recovery fixes
  • userspace client: many bug fixes, now quite stable
  • librados improvements

Also,

  • libceph: a thin wrapper around the POSIXy ceph interface

which is being used to write a file system ‘Broker’ for the Hypertable distributed database project.  We’re also planning on (finally) getting the Hadoop ceph client in working order.

We’re also continuing to work on the librados object storage layer, including a standalone fastcgi-based gateway exposing an S3-compatible restful interface, the goal being a drop-in replacement for apps using S3. (It won’t let you use the rados snapshots or object classes, though, and won’t scale as efficiently.)

As far as testing goes, we’re filling up a 100TB cluser locally and will start failure testing on that shortly.  And this past week we’ve been thorougly testing single-node) MDS recovery.  Next up is looping OSD restarts and power cycling.

Major todo items coming up next:

  • client authentication
  • additional metadata to facilitate catastrophic rebuild of fs hierarchy
  • stabilize clustered mds

We’ve also sent the Linux kernel client code off to LKML and -fsdevel again, and are continuing to work toward a merge into the mainline kernel.

UPDATE: Here are the relevant URLs:

posted by sage | 1 Comment | Tags: Releases

6 June 2009 - 14:36RADOS snapshots

Some interesting issues came up when we started considering how to expose the RADOS snapshot functionality to librados users.  The object store exposes a pretty low-level interface to control when objects are cloned (i.e. when an object snapshot is taken via the btrfs copy-on-write ioctls).  The basic design in Ceph is that the client provides a “SnapContext” with each write operation that indicates which snapshots logically exist for the given object; if the version already stored by the OSD is older than the newest snapshot in the SnapContext, a clone is created before the write is applied.  It is the Ceph MDS’s responsibility to keep track of which snapshots apply to which objects (remember, Ceph lets you snapshot any subdirectory) and to do all the synchronization that ensures mounted clients have up to date SnapContexts.

In creating a raw object storage interface, how is that underlying functionality best exposed?  One option is to expose some functions that allow users to create, manipulate, and possibly store SnapContexts, and manually specify a context for each write (or a snapshot id to read).  This exposes the same functionality Ceph makes use of, but essentially drops all of the issues with synchronization and storage in librados user’s lap.  How should one go about keeping multiple processes accessing the RADOS store in sync (i.e. agreeing on which snapshots exist) to get the semantics people want?

Our solution is to introduce some basic snapshot accounting to RADOS.  We allow per-pool snapshots to be created via RADOS itself, and include that snap information in the OSDMap (the global data structure used to synchronize the activities of OSDs and clients).  If a client performs a write and does not manually specify a SnapContext (as Ceph does), an appropriate context will be generated from the pool snapshot information in the OSDMap.

Snapshot creation is done via the monitor, either via a librados API call or an administrator command like ‘ceph osd pool mksnap poolname snapname’.  This updates the OSDMap to include the new snap for that pool, and that map propagates across the cluster.

int rados_snap_create(rados_pool_t pool, const char *snapname);
int rados_snap_remove(rados_pool_t pool, const char *snapname);
int rados_snap_list(rados_pool_t pool, rados_snapid_t *snaps, int maxlen);
int rados_snap_get_name(raods_pool_t pool, rados_snapid_t id, char *name, int maxlen);

To read an existing snapshot, a new RADOS pool context is opened and a specific snapshot id is selected (the id can be obtained via rados_snap_list above):

rados_pool_t snapped_pool;
rados_open_pool(”data”, &snapped_pool);
rados_set_snap(snapped_pool, 2);

Subsequent reads via the snapped_pool handle will return data from snapid 2, and any attempts to write will return -EROFS (Read-only file system).  Reading and writing via other rados_pool_t handles will be unaffected.  By default any newly opened pool handle will be “positioned” at the “head”–the current, writeable version of the object pool.

Map propagation is fast, but not synchronous: it is possible for one client to create a snapshot and for another client to then perform a write that does not preserve some data in the new snap.  So we do not completely solve the synchronization problem for you to create a global, ‘instantaneous’ point-in-time snapshot.  Doing so in a large distributed environment with many clients and many servers, operating in parallel, is a challenge in any system.

From the perspective of the client creating the snapshot, however, the snapshot is ordered with respect to IO performed before and after rados_snap_create.   RADOS already does some synchronization with respect to OSDMap updates to ensure that readers, writers and OSDs all agree on the current state of a placement group when performing IO.  Any IO initiated after the snapshot is created will be tagged with the new OSDMap version, and any OSD will make sure it has either the same or a newer version of the map before performing that IO.  Other clients will not see a clear ordering unless the librados user takes steps to coordinate clients such that they all obtain the updated OSDMap (describing the new snapshot) before performing new IO.

If there is demand, we may still expose an API to manipulate raw SnapContexts for advanced users wanting different snapshot schedules for different objects.  It will be their responsibility to manage all client synchronization in that case, as that snapshot information won’t be propagated via the OSDMap.

For anybody wanting perfect cluster-wide point-in-time snapshots without any client coordination… well, sorry.  Experience with file system snapshots has shown that proper synchronization is never something that the storage system alone can get right due to caching at all layers of the system.  NFS client write-back caches make server-based snapshots (e.g., NetApp filers) imperfect.  Snapshots in local file systems utilize some kernel machinery to momentarily quiesce all IO while the snapshot is created, but even applications may not have the on-disk files (as seen by the OS) in a consistent state.  Coordination with applications is always necessary for any fully ‘correct’ solution, so we won’t try to solve the whole problem based on some false sense of what ‘correct’ is.

posted by sage | No Comments | Tags: Dev notes, RADOS