Th30z (Matteo Bertozzi Code): RaleighFS

Showing posts with label RaleighFS. Show all posts

Saturday, July 30, 2011

RaleighFS to enter in the in-memory key-value store market

A couple of guys asked me about RaleighFS, and why is called File-System instead of Database, and the answer is that the project is started back in 2005 as a simple Linux Kernel File-System, to evolve in something different.

Abstract Storage Layer
I like to say that RaleighFS is an Abstract Storage Layer, because is main components are designed to be plugable. For example the namespace can be flat or hierarchical, and the other objects don't feel the difference.

Store Multiple Objects with different Types (HashTable, SkipList, Tree, Extents, Bitmap, ...)
Each Object as it's own on-disk format (Log, B*Tree, ...).
Observable Objects - Get Notified when something change.
Flexible Namespace & Semantic for Objects.
Various Plain-Text & Binary Protocol Support (Memcache, ...)

A New Beginning...
Starting weeks ago, I've decided to rewrite and refactor a bit of code, stabilize the API and, this time, trying to bring the file-system and the network layer near to a stable release.
First Steps are:

Release a functional network layer as soon as I can.
Providing a pluggable protocol interface.
Implement a memcache capable and other protocols.

So, these first steps are all about networking, and unfortunately, this means dropping the sync part and keep just the the in-memory code (the file-system flush on memory pressure).

Current Status:
Starting from today, some code is available on github under raleighfs project.

src/zcl contains the abstraction classes and some tool that is used by every piece of code.
src/raleighfs-core contains the file-system core module.
src-raleighfs-plugins contains all the file-system's pluggable objects and semantics layers.
src/raleigh-server currently contains the entry point to run a memcache compatible (memccapable text protocol), and a redis get/set interface server. The in-memory storage is relegated in engine.{h,c} and is currently based on a Chained HashTable or a Skip List or a Binary Tree.

How it Works
As I said before the entry point is the ioloop, that allows clients to interactot through a specified protocol with the file-system's objects. Each "protocol handler" parse it's own format, convert it to the file-system one, and enqueue the request to a RequestQ that dispatch the request to the file-system. When the file-system has the answer push the response into the RequestQ and the client is notified. The inverse process is applied the file-system protocol is parsed and converted into the client one.

Having the RequestQ has a couple advantages, the first one is that you can wrap easily a protocol to communicate with the filesystem, the other one is that the RequestQ can dispatch the request to different servers. Another advantage is that the RequestQ can operate as a Read-Write Lock for each object allowing the file-system to have less lock...

For more information ask me at theo.bertozzi (at) gmail.com.

Friday, August 13, 2010

Self Contained C Data Structures and Utils Repository

I've rewritten a couple of data structure and utils (C lang) that I use for my RaleighFS Project, and I've released it at github under BSD License.

At the moment there are some data structures and helper functions:

Memory Pool - Simple Memory Allocator.
Memory Utils - memcpy(), memset(), memcmp(), memswap() with some variants.
Memory Block - copy-on-write block.
Hashed Containers: Hashtable, Set, Bag.
Queue - FIFO/Circular Queue.
Cache - LRU/MRU Cache.
WorkQueue - Lightweight pthread workqueue.
Sort/Merge functions.

Every "object" is self contained, you can grab the "hashtable.c/hashtable.h" file to be ready to use the hashtable, no other dependence is required. And if data structure require a malloc() you can specify your own allocator during object initialization.

git clone http://github.com/matteobertozzi/carthage.git

Saturday, July 10, 2010

RaleighFS is moving to Flash Disks

No, this is not going to happen for a couple of years, I love spinning disk, because file-system can really do the difference in term of performance.

But… This week I've talked a bit with my colleague batt, that is working on some new features of his BattFS available on BeRTOS. And I've discovered that flash disk are challenging too, especially on embedded devices where you cannot use large cache due to ram limitation.

What changes, on flash, from a traditional File-System? First of all due to limited rewrite cycle of a block, you cannot modify a block in place (Append-Only Data Structure needed). Append-Only Index is not a problem, but Traditional file-system heavily rely on a fixed location super-block that contains the root pointer, and append-only tree root changes every time, so super-block need to be re-writed.

The first solution that RaleighFS can quickly implement is to write a new super-block each flush. In this way each flush (or transaction) can be easily identified. But how can I reach the last super-block and what is the last super-block?

The Worst case is to scan the whole disk to find the super-block "magic" number in block header.

But what we can do is to add some information to the super-block

Super-Block Id
Next Super-Block Id
Next Super-Block Location

In this way on each flush we "predict" what is the location of the next super-block, and we give it an unique id just to be sure to identify the next one. In this way we can avoid to scan the whole disk but just a few Mb (flush size) and then we can easily jump to the next super-block until the super-block next location isn't valid.

Block allocation is just a "round-robin" allocation in this way we're able to keep aligned "every" block erase cycle, and this solve the super-block location problem.

What about the tree? Tree is the common B*Tree of RaleighFS with a few changes:

Modified leaf node is allocated/writed in a new place
Root and Twigs Nodes are rewritten only if keys changes.

Wait wait, how we can reach the new leaf node? To avoid large Twigs rewrite we keep a small BTree (high fanout) that maps old leafs block numbers with the new ones. Free blocks can be stored on disk or better due to 0-seek time we can just check all the tree pointers to see what's allocated.

I'm not really convinced about the map BTree, but seems to solve the "large update" of value modification in key/value storage case… This is just a fast "port" of current RaleighFS to SSD but maybe something really different can be used...

Saturday, January 30, 2010

Directories and File-System, RaleighFS Way!

When you talk about a new file-system to someone, one of the first question that you will hear is "It will suffer from directory unbalancing" or "How performs with larger directories".

Ext3 seems to have spread the fear of large directory, but What is a directory? and what large directories mean?

When you design a File-System you've to decide what is your target/end-user. Don't try to design your file-system for everyone, but just for one.

So, what's your focus? Be the fastest file-system for continuous 'ls' (Directory traversing)?

My idea and RaleighFS idea, is that File-System is just a "Big Hashtable" with some useful information, where you ask for an item 'x' and you'll receive it's data, and metadata. Following this idea, what is a directory? It's just a name-container that allows user to keep track of his items. At this point, what is the best way to represent Directory?

Directory is just another item in the FS, like your mp3, c source file, and all the other stuff.

When you're a high-level developer, sometimes you forget how the low-level works, in this case the File-System. The FS has blocks when you remove something from the begin/mid of a file you shift back all the rest and you rewrite the entire file to avoid this you can intelligently pack your data based on the "block-data size" of the file system. (Block data size is not the block size, maybe the file-system add its own block header to your data).

The above picture shows the RaleighFS Directory Block data structure. There's a small block header that say how much of space are free in this block and there're directory items that are simply 2-field struct, 1 byte name size and N chars for the name.

block_size = 8192 byte
block_head = 12 byte
dir_head = 2 byte
avg_dir_item_size = 1byte + 15 byte (avg name size)
(block_size - block_head - dir_head) / avg_dir_item_size = 511 entries in one block

In a few blocks you can handle large amount of files, and the remove operation are really quick. Just find your item, move back just your "block" friends and decrease the Directory Header free space. In this way you don't have to rewrite all the item but just one or few blocks of it.

Remember, in RaleighFS, directory is just a name-container that allows user to keep track of his items (ls /home/) and on disk directory is just a file there's no unbalance. Every item is looked-up in a O(1) Time, like traditional Hashtables.

Pages