24 March 2008 - 15:42Client cache coherency

I’ve been shying away from the question of how to manage the client metadata cache consistency for ages, now, under the assumption that it was going to complicate the client/MDS protocol and MDS significantly. Zach’s progress on CRFS got me thinking about it again, though, and I had a realization the other night that most of the complex parts have already been dealt with in the course of making replication across the MDS cluster and client capabilities on file data work:

  • Most MDS code is already written in terms of a generic lock framework that is allowed to block
  • Client session timeout infrastructure (for coping with dead or unresponsive clients) is already there for dealing with file I/O capabilities

The client was already maintaining and checking per-object TTL values based on a lame fixed-timeout caching scheme. All that was really needed is some additional flags in MDS replies messages to grant leases, an additional object in the MDS to track client replicas of metadata objects, and a simple lease revocation/release message handler. I was pleasantly surprised to have things basically working after only a few hours of coding.

The protocol and data structures will initially only support relatively simple combinations for leases. For example, different inode fields are protected by different locks (e.g. uid/gid/mode versus file size), and leases on fields can be granted or revoked independently, but they will share a single lease interval. Really, though, this will capture the bulk of the performance for most workloads, and keeps things (relatively) simple on the MDS.

Eventually, I’d like to apply the file capabilities to directories to allow clients to create files and write to them asynchronously from the MDS (by essentially preallocating unused inode numbers). I haven’t thought it all through yet, but I think this will provide most of the missing infrastructure to make something like that work…

posted by sage | No Comments | Tags: Dev notes

18 March 2008 - 20:34File system creation and scaling

I’ve spent the last week or so revamping the whole “mkfs” process and some of the machinery needed to adjust data distribution when the size of the cluster changes by an order of magnitude or more. The basic problem is that data is distributed in a two-step process: objects are first statically mapped into one of many “placement groups” (PGs), and PGs then move somewhat dynamically between storage nodes as storage is added or removed from the system, disks fail, and so forth.

Click title to read more…

posted by sage | No Comments | Tags: Dev notes

13 March 2008 - 10:58Google Summer of Code

We’re applying for the Google Summer of Code this year. Some project ideas are posted in the wiki, although of course it is by no means an exhaustive list. From what I’ve read, some of the most successful GSoC projects are proposed by the students. That said, there are lots of interesting projects on the list that could use some love.

Update: We weren’t accepted. I had no idea it was so competitive to be selected as a GSoC org! If we apply again next year, I’ll definitely spend more time on the all-important “ideas” page (more detailed descriptions of suggested project areas, references to additional information, skills required, level of difficulty, etc.).

posted by sage | 3 Comments | Tags: Updates

1 March 2008 - 22:08Blog!

I’m setting up this blog as a way to let people keep tabs on development progress. I’ve been periodically sending updates to the ceph-devel list, but I don’t think that’s the ideal venue for announcements, and it seems unlikely they’ll be dug out of the mailman archives by many people. So, here goes!

posted by sage | 4 Comments | Tags: Updates