Showing posts with label Algorithms. Show all posts
Showing posts with label Algorithms. Show all posts

Saturday, January 30, 2010

Directories and File-System, RaleighFS Way!

When you talk about a new file-system to someone, one of the first question that you will hear is "It will suffer from directory unbalancing" or "How performs with larger directories".

Ext3 seems to have spread the fear of large directory, but What is a directory? and what large directories mean?

When you design a File-System you've to decide what is your target/end-user. Don't try to design your file-system for everyone, but just for one.

So, what's your focus? Be the fastest file-system for continuous 'ls' (Directory traversing)?

My idea and RaleighFS idea, is that File-System is just a "Big Hashtable" with some useful information, where you ask for an item 'x' and you'll receive it's data, and metadata. Following this idea, what is a directory? It's just a name-container that allows user to keep track of his items. At this point, what is the best way to represent Directory?

Directory is just another item in the FS, like your mp3, c source file, and all the other stuff.


When you're a high-level developer, sometimes you forget how the low-level works, in this case the File-System. The FS has blocks when you remove something from the begin/mid of a file you shift back all the rest and you rewrite the entire file to avoid this you can intelligently pack your data based on the "block-data size" of the file system. (Block data size is not the block size, maybe the file-system add its own block header to your data).

The above picture shows the RaleighFS Directory Block data structure. There's a small block header that say how much of space are free in this block and there're directory items that are simply 2-field struct, 1 byte name size and N chars for the name.
block_size = 8192 byte
block_head = 12 byte
dir_head = 2 byte
avg_dir_item_size = 1byte + 15 byte (avg name size)
(block_size - block_head - dir_head) / avg_dir_item_size = 511 entries in one block

In a few blocks you can handle large amount of files, and the remove operation are really quick. Just find your item, move back just your "block" friends and decrease the Directory Header free space. In this way you don't have to rewrite all the item but just one or few blocks of it.

Remember, in RaleighFS, directory is just a name-container that allows user to keep track of his items (ls /home/) and on disk directory is just a file there's no unbalance. Every item is looked-up in a  O(1) Time, like traditional Hashtables.

Saturday, January 9, 2010

Base Raleigh Library is out

In these days I've rewritten parts of my Raleigh Library, that I use for my work-in-progress file-system (RaleighFS). The Library provides the basic Abstraction mechanism for read/write data from devices, client-server communication, threading and various data structures like Hashtable, Sets, In-Memory Stream (Chunk), Red-Black Tree, B*Tree and other things.

I've published the base source code, with BSD license, at google code (http://code.google.com/p/raleigh). You can download it using Mercurial.

hg clone https://raleigh.googlecode.com/hg/ raleigh


The usage is very simple, and the source code contains Test Cases and Examples that shows how to use all the "classes". For more information feel free to send me a mail.

Saturday, September 5, 2009

File-System and Data Block Back Reference

While I'm thinking and waiting for suggestions on how to improve my file-system block cache algorithm, I've decided to apply some changes to the Raleigh File-System Format (source code is not published yet).

Following the ideas of Valerie Aurora of Repair-driven File System Design, I've decided to add for each block (B*Tree and Data blocks) an head that contains a Magic Number and a CRC Sum of the block. In this way you can easily identify what kind of block you've peeked without scanning all metadata. Another step is to add a back reference (or back pointer) to the data block, in this way you can easily jump back to it's the extent block (and obviously to its OID) so you can easily understand what is the Object owner of this block and you can easily swap two blocks reading at most 4 blocks (2 Data and 2 Extends).

Another idea stolen from Valerie is to double the metadata blocks with a COW-like approach, as explained in this paper "Double the Metadata, Double the Fun: A COW-like Approach to FS Consistency", really useful for personal file-systems but maybe less in a distributed file-system. I'm working on it adding only as an mkfs option.

When the source Code will be online? I don't know.. I've less time to work on it. Maybe at the end of this year I'll publish the File-System and the Distributed System (explained some posts ago).
Double the Metadata, Double the Fun: A COW-like Approach to File

System Consistency"

Tuesday, August 25, 2009

Block Cache Algorithm

I need to replace my old filesystem cache algorithm with something more new and efficient. The old one is based on LRU/LFU algorithm. There's a queue of cached blocks and an Hashtable to speedup block lookup.


struct blkcache_buf {
struct blkcache_buf * next; /* Next Queue Item */
struct blkcache_buf * prev; /* Prev Queue Item */
struct blkcache_buf * hash; /* Next Item with the same hash */

xuint16_t count; /* Retain count */
xxx_block_t block; /* Cached Block */
};

typedef struct {
struct blkcache_buf ** buf_hash; /* Bufs Hashtable */
xuint16_t buf_hash_size; /* Bufs Hashtable Size */
xuint16_t buf_used; /* Bufs in use */

struct blkcache_buf * head; /* Head of the Bufs Queue */
struct blkcache_buf * tail; /* Tail of the Bufs Queue */

xxx_device_t * device; /* Block Device used for I/O */
} xxx_blkcache_t;


Above, you can see the cache data structure and below the core of the cache Algorithm.


#define BUFHASH(cache, blocknr) ((blocknr) % (cache)->buf_hash_size)

xxx_block_t *xxx_blkcache_read (xxx_blkcache_t *cache,
xxx_blkptr_t blocknr)
{
struct blkcache_buf *buf;
xuint16_t hash_index;

/* Scan the hash chain for block */
hash_index = BUFHASH(cache, blocknr);
if ((buf = __blkcache_find(cache, blocknr, hash_index)) != NULL) {
buf->count++;

/* Move Buf far from head */
__blkcache_buf_shift(cache, buf);

return(&(buf->block));
}

/* Cache is Full, Remove one Item */
if ((cache->buf_used + 1) > cache->buf_hash_size) {
/* All buffers are in use */
if (cache->head->count > 0)
return(NULL);

/* Remove Least-Frequently Used */
buf = __blkcache_remove_lfu(cache);
cache->buf_used--;
}

/* Desidered block is not on available chain, Read It! */
if ((buf = __blkcache_buf_alloc(cache, buf, blocknr)) == NULL)
return(NULL);

/* Add One Use, Block cannot be removed */
buf->count++;

/* Go get the requested block unless searching or prefetching. */
__blkcache_readwrite(cache, buf, RFALSE);

/* Update Cache Hash */
cache->buf_used++;
buf->hash = cache->buf_hash[hash_index];
cache->buf_hash[hash_index] = buf;

/* Update Cache Queue */
if (cache->head == NULL) {
cache->head = cache->tail = buf;
} else {
buf->prev = cache->tail;
cache->tail->next = buf;
cache->tail = buf;
}

return(&(buf->block));
}


You can download a demo implementation here: lru-block-cache.c, but I'm waiting some ideas or suggestions to improve (or change radically) the Cache Algorithm. Thanks in advance!