From Fedora Project Wiki


Project Description

BTRFS is a new, actively developed file system with various advanced features. I wish to implement content-based-storage mode for btrfs file system. In fact, this project is also mentioned in the TODO-list of the BTRFS ideas page.

In some applications, such as Internet content-caches, most often than not, the data is read-only. For such cases, the lookup time is the most important metric. It is very inefficient for such applications to store data in a conventional file-path based manner. In content-based-storage mode, the data is stored on the disk only on the basis of "hash" of its content. The lookup is also hash based - thus extremely quick. Another advantage of hash-based storage is that data duplication is not possible.

My research at CMU aims at building content-caches for routers https://github.com/harshadjs/xia-content-cache. It demands a file system that allows such a storage mode. I think it would be ideal for the interests of BTRFS community and the research at CMU if I could work on this project in the summer.

Biography and Technical Background

I am a Computer Science Graduate student at Carnegie Mellon University with research interest primarily in Computer Networks. I use Linux daily and am passionate about Open source software development.

In my undergraduate years, I worked on a open-source Linux kernel project "Snapshots for Ext4 filesystem". Patches were sent to the Ext4 community for review. I received a mention for the contribution to the project at http://lwn.net/Articles/442078/ .

We were interested in extend Ext4 snapshots project, and so I participated in Google Summer of Code 2011. My proposal for "Snapshot revert feature for Ext4" was accepted by The Fedora Project and I successfully completed the project back then. I look forward to continue my interest and be associated with the Fedora project by applying the proposal "Content-storage mode for BTRFS" for the year 2015.

I have worked for a Wi-Fi technology startup "AirTight Networks" for 3 years (2011-2014), where I was working in the Linux device drivers team.

I then joined Carnegie Mellon University in May 2014, where my main area of studies is Computer Networks.

You can expect a very high level of fluency with C and Kernel programming from me. This is something that I love to do.

Goals

  • 75% Goal
    • Create a new "Content" tree. This tree should store hashes of all the extents in the file system.
    • Create a "File Hash" tree. This tree should will store the mapping from hash of a file to its inode.
    • Provide option to enable / disable content-storage-mode at mount-time or mkfs-time (TBD).
    • Implement all the reference counting mechanisms for extents in this content-tree.
  • 100% Goal
    • Intercept writes and check if the data that is being written is already in the content tree.
    • Intercept reads
      • Given the hash of file, lookup inode for a file from "File Hash" tree.
    • Enhance debugging methods available in btrfs (I am not sure which ones are available) to support debugging content-trees.
  • 125% Goal
    • Provide various mount-time configuration options, such as:
    • Remove or Don't remove extents if reference count becomes 0. (Especially useful for our routing application.)
    • Verify or Trust the checksum of extents.

Milestones of the Project

  • M1: Understand the design and code of Btrfs. Especially focus on how the current extent-trees, subvolume trees, snapshot trees are setup initially. Study on-disk data structures, most likely, we are going to need to add some bits in the super-block: For example "content-storage-mode-on/off".
  • M2: Understand and identify the code areas wherein the hooks are to be applied. Need to find hooks for:
    • Intercepting writes
    • Reading extents
    • Debugging interfaces
  • M3: Write a detailed design draft which will talk about all the overall goal, required on-disk-changes, functions to be modified. Share the draft with BTRFS community and get their views.
  • M4: Implementation and testing of the code: 75%
  • M5: Implementation and testing of the code: 100%
  • M6: Implementation and testing of the code: 125% (If time permits)
  • M7: Write documentation of the final product

Plan of action

  • By the end of the week 1: M1, M2
  • By the end of the week 2: M3
  • (Midterm) By the end of the week 5: M4
  • By the end of the week 7: M5
  • By the end of the week 9: M6
  • (End) By the end of the week 10: M7

Why choose me?

  • Past successful GSoC student (2011).
  • Past experience of working with the open source community.
  • Strong understanding of file systems, C programming language, the UNIX philosophy, Linux.
  • Passionate about contributing to Linux.

Time commitment

Apart from this project, I have research commitment at CMU. So, I expect to spend at least 30 hrs / week on this project. My final exams end on 13th May 2015 and I hope to start right after that. I will be visiting my hometown (Pune, India) towards the May-End / June first week. That is the only time when I could be a little slacked. Rest of the summer, I will be on top of the project.