16 October 2009 - 12:55Kernel client git trees have moved

The kernel client git trees have moved to kernel.org.  The main line of development is in a kernel tree that contains the Ceph client:

 git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git

Generally speaking, the master branch will contain stable code that is ready to be pushed upstream, while the unstable branch has the bleeding edge (and may be rebased).

There is also a git tree containing just the Ceph module source.  It mirrors commits from the main tree (for fs/ceph/* only), so there is a useful history, and it also contains ‘backport’ branches that will build on older kernels.

git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client-standalone.git

The userspace server side code (ceph.git) hasn’t moved; it’s still at

git://ceph.newdream.net/ceph.git

Enjoy!

posted by sage | No Comments | Tags: Uncategorized

24 February 2009 - 14:16Debian packages

I’ve built some debian packages for both the userspace daemons and the kernel module source.  Trying things out is now as simple as adding a few lines to your apt sources file and doing an apt-get install!  More info in the wiki.

posted by sage | 4 Comments | Tags: Uncategorized

16 December 2008 - 10:52Scrubbing

The last month has seen a lot of work on the storage cluster, fixing recovery related bugs, improving threading, and working out a mechanism for online scrubbing.  In this case, scrubbing is basically a low-level fsck of the object storage layer.  For each PG being scrubbed, the primary and any replica nodes generate a catalog of all objects in the PG and compare them to ensure that no objects are missing or mismatched (currently we check  size and attributes; soon, we’ll pull the checksums out of btrfs to ensure the object contents match too).  Assuming the replicas all match up, one OSD does a final semantic sweep to ensure that all of the snapshot-related object metadata is consistent. Errors are reported to a (new) central system log.

An administrator can tell the system to scrub the entire storage cluster, a single OSD, or a single placement group.   Eventually, we’ll probably want to have the system automatically schedule a slow background scrub when the system is idle.

This is only one piece of the overall ‘fsck’ problem–the file system metadata is more complicated and also needs to be verified.

posted by sage | No Comments | Tags: Uncategorized

24 July 2008 - 9:19Snapshot progress

If things seem a bit slow lately, it’s because I’ve been primarily working
on implementing the snapshot mechanism for the last few weeks.  This is
coming along pretty well: I can take snapshots and access snapshotted
content.  The interaction with recursive accounting has been tricky
because delayed propagation means changes may propagate into recent
snapshot as changes work their way up the hierarchy, but I think I have
that one nailed.

Here’s how it works:

$ tar jxf ~/src/linux-2.6.24.tar.bz2 &
[1] 18715
$ mkdir linux-2.6.24/.snap/1   # create a few snapshots
$ mkdir linux-2.6.24/.snap/2
$ mkdir linux-2.6.24/.snap/3
$ kill %1
$ ls -al linux-2.6.24/.snap    # see that dir sizes increased over time
total 3
drwxr-xr-x 1 sage sage 1205808 Jul 24 10:23 ./
drwxr-xr-x 1 sage sage 1205808 Jul 24 10:23 ../   # live copy
drwxr-xr-x 1 sage sage 1028511 Jul 24 10:23 1/
drwxr-xr-x 1 sage sage 1144455 Jul 24 10:23 2/
drwxr-xr-x 1 sage sage 1177913 Jul 24 10:23 3/
[1]+  Terminated              tar jxf ~/src/linux-2.6.24.tar.bz2
$ ls linux-2.6.24/.snap/1/Documentation/ | wc
23      24     472
$ ls linux-2.6.24/.snap/3/Documentation/ | wc
32      33     680

Etc.  The ‘.snap’ hidden dir is accessible from anywhere (like .snapshot
on a Netapp).  Snapshots can be created for any directory at any time,
however, and recursively apply to all nested content.

Still left to do:

  • properly handle directory renames (which interact in interesting ways with the snapshot realm tree).
  • snapshot deletion
  • garbage collection (metadata and data)
  • update kernel client (I’m currently working just with the fuse clientfor faster prototyping)

posted by sage | No Comments | Tags: Uncategorized

10 July 2008 - 16:42Next up: snapshots!

One of the last intrusive additions I have planned is a flexible snapshot mechanism.  I haven’t been able to figure out how to map writeable snapshots onto the current object and metadata storage model, unfortunately, so it’ll be read-only snapshots for now.  Ceph snapshots will be significantly more flexible than what you find with WAFL or ZFS, though.  The goal is to get behavior like:

$ cd any/random/directory
$ ls .snapshot
$ mkdir .snapshot/foo      # create a snapshot
$ ls .snapshot
foo
$ cd a/deeper/dir
$ ls .snapshot
foo
$ mkdir .snapshot/bar      # create another one
$ ls .snapshot
foo    bar
$

That is, users can create snapshots, from a standard shell, for any subtree of the directory hierarchy.  (In contrast, most proprietary vendors’ snapshots are for entire volumes only, while ZFS can only snapshot predefined subvolumes.)  And snapshots will be visible via a hidden .snapshot (or similar) directory from any directory.  Something similarly convenient (rmdir?) will be used to delete snapshots from the command line. The naming will be a bit more complicated than in the above example to avoid name collisions, but that is the basic idea.

posted by sage | No Comments | Tags: Uncategorized

23 May 2008 - 14:39Recursive accounting for size, ctime, etc.

Watch the size on the directories:

$ mkdir foo ; cd foo
$ mkdir -p dir1/subdir
$ mkdir -p dir2
$ echo 123456789 > dir1/file.10
$ echo 12345678901234 > dir1/file.15
$ echo 1234 > dir1/subdir/file.5
$ echo 12345678901234567890123456789 > dir2/file.30
$ find . -ls
1 drwxr-xr-x   1 sage     sage           60 May 23 15:23 .
1 drwxr-xr-x   1 sage     sage           30 May 23 15:23 ./dir1
1 -rw-r--r--   1 sage     sage           10 May 23 15:23 ./dir1/file.10
1 -rw-r--r--   1 sage     sage           15 May 23 15:23 ./dir1/file.15
1 drwxr-xr-x   1 sage     sage            5 May 23 15:24 ./dir1/subdir
1 -rw-r--r--   1 sage     sage            5 May 23 15:24 ./dir1/subdir/file.5
1 drwxr-xr-x   1 sage     sage           30 May 23 15:24 ./dir2
1 -rw-r--r--   1 sage     sage           30 May 23 15:24 ./dir2/file.30

The client is configured to give the “nested” size when we stat() the directory, telling you the sum of all file sizes within and nested beneath a directory. This is basically arbitrary granularity directory-based quota accounting.

Internally, the MDS is actually doing recursive accounting for:

  • file size (bytes)
  • ctime (consider backup software scanning the FS for changes)
  • file count
  • directory count

A few notes and caveats (of course):

  • There may be some delay before the recursive stats propagate up the hierarchy, particularly when the hierarchy spans multiple metadata servers. The plan is ensure that times are pushed up at least once every minute or something. In general, though, stats are pushed up as long as there are no conflicting locks.
  • Directories don’t currently contribute any “bytes” to the total size.  It should probably be some estimate of the amount of disk space used storing the directory’s metadata.
  • The directories’ i_blocks are not recursively defined, so ‘du’ will still work (although you probably won’t want to use it).
  • This is a summation of raw file i_sizes, not blocks used. That means sparse files “appear” larger than they are.
  • Ceph internally distinguishes between multiple links to the same file. Only the first (”primary”) link to each file is counted recursively.

posted by sage | No Comments | Tags: Uncategorized