Th30z (Matteo Bertozzi Code): File-System

Showing posts with label File-System. Show all posts

Friday, April 20, 2012

Embedded Store, under the hood...

This week I've found an interesting bug, that can be summarized in this way. The user does not have any idea of what happens under the hood, and his main usage is always against your design.

To give you more context, the bug was related to embedded storage systems, something like bsddb, sqlite or just your simple B+Tree or your on-disk HashTable.

So, How an embedded storage system is designed?
As you can see from the simplified schema on the right

The lowest level is the raw access to the disk data structure (e.g. B+Tree, HashTable, ...) so each request goest directly to disk.
On top of that, to speedup things, you add a cache to avoid fetching data from disk all the time.
And everything is packed in a nice api that provides you some sort of get and put functionality, at maximum speed using the cache.

Everything seems fine, You can create an application that access your library capable of handling tons of request without accessing the disk due to the cache layer and so on, and you can think even to build a service to be able to query your storage from a remote host, and here the problems begin.

Your first super happy user arrive and decide to build its super scalable infrastructure with your embedded storage.
..and what is the easy way to get a super scalable service? obviously adding some threads.. but threads are not a problem, because the user has learned the lesson and now knows that he should not use shared variables. So the brilliant solution is each thread has its own instance of the "storage object", to be more clear each thread do something like db.open("super-storage.db")

Everything seems fine... but after a couples of days the user started crying... sometimes data is missing, logs contains strange page not found messages, some part of the store is corrupted, and so on...

Can you spot the problem? Yes, is the cache...
No one is aware of the changes in the file, every thread use its own cache, and the first request to a block not in cache ends up to create some bad state.

So the solution for the user is to use the library as you've designed, with just one thread/process/what ever that access the store file.

But if you want slow down your super cool library and make the user happy you can always add an ID to the super-block and every time the user request something... you fetch the super-block from disk, compare with the one in cache, and if they are different you can just invalidate all the caches...

Sunday, February 26, 2012

Thoughts on bucket flushing, block size and compression...

This post is just a collection of thoughts, on how to store data. I hope to get some feedback and new ideas from you guys, thanks!

Since I've started working on B*Trees, I've always assumed to have a fixed block size. With this "restriction" & "simplification", you can easily come up with a block in this format:

Keys are packed together in the beginning of the block and values are packed together in the end of the block, growing up toward the center. In this way you don't have to move data when a new key is added or keys & values when one value grows.

Assuming that your inserts in the block are already sorted (e.g. flush "memstore" to disk), in this way you can even compress keys & values with different algorithms.

...but, with the fixed size block and values starting from the end you need to allocate a full block.

In contrast, you can "stream" your data and flush it, just few kb of data at the time. In this case you don't have to allocate a full block, but you've lost the ability to use different compressions for keys & values and the ability to keep in memory only the keys without doing memcpys.

Another possibility is traverse the data twice, the first time writing the keys and the second time writing the values. In this case you gained the different compression features but if you're using compression you're not able to stop after a specified block size threshold because you don't know how much space each key & value takes.

Any comments, thoughts, ideas?

Sunday, February 5, 2012

AesFS - Encrypted File-System layer

Last week I've spent a couple of hours playing with OpenSSL and CommonCrypto, and the result is a tiny layer on top the file-system to encrypt/decrypt files using AES.

The source code is available on my github at misc-common/aesfs. To build just run 'python build.py' and the result is placed in ./build/aesfs/ and ./build/aespack/. AesPack is a command line utility to pack and unpack single files, while aesfs is a fuse file-system.

You can run aesfs with:

aesfs -o root=/my/plaindir /mnt/encrypted

Since AesFS is on top your file-system you've to specify (with -o root) the directory where you want to store files, while the /mnt/ is the mount point where you can read/write your files in clear.

Files are written in blocks of 4k (8 bytes of header, 16 extra bytes for aes, and 4072 bytes of data). Check block.h and block.c for more information.

Sunday, January 16, 2011

Hadoop I/O: Sequence, Map, Set, Array, BloomMap Files

Hadoop' SequenceFile provide a persistent data structure for binary key-value pairs. In contrast with other persistent key-value data structures like B-Trees, you can't seek to a specified key editing, adding or removing it. This file is append-only.

SequenceFile has 3 available formats: An "Uncompressed" format, A "Record Compressed" format and a "Block-Compressed". All of them share a header that contains a couple of information that allows the reader to recognize is format. There're Key and Value Class Name that allows the Reader to instantiate those classes, via reflection, for reading. The version number and format (Is Compressed, Is Block Compressed), if compression is enabled the Compression Codec class name field is added to the header.

The sequence file also can contains a "secondary" key-value list that can be used as file Metadata. This key-value list can be just a Text/Text pair, and is written to the file during the initialization that happens in the SequenceFile.Writer constructor, so you can't edit your metadata.

As seen Sequence File has 3 available formats, the "Uncompressed" and the "Record Compressed" are really similar. Each call to the append() method adds a record to the sequence file the record contains the length of the whole record (key length + value length), the length of the key and the raw data of key and value. The difference between the compressed and the uncompressed version is that the value raw data is compressed, with the specified codec, or not.

In contrast the "Block-Compressed" format is more compression-aggressive. Data is not written until it reach a threshold, and when the threshold is reached all keys are compressed together, the same happens for the values and the auxiliary lists of key and value lengths.
As you can see in the figure on the left, a block record contains a VInt with the number of the buffered records and 4 compressed blocks that contains a list with the length of the keys, the list of keys, another list with the length of the values and finally the list of values. Before each block a sync marker is written.

Hadoop SequenceFile is the base data structure for the other types of files, like MapFile, SetFile, ArrayFile and BloomMapFile.

The MapFile is a directory that contains two SequenceFile: the data file ("/data") and the index file ("/index"). The data contains all the key, value records but key N + 1 must be greater then or equal to the key N. This condition is checked during the append() operation, if checkKey fail it throws an IOException "Key out of order".

The Index file is populated with the key and a LongWritable that contains the starting byte position of the record. Index does't contains all the keys but just a fraction of the keys, you can specify the indexInterval calling setIndexInterval() method. The Index is read enteirely into memory, so if you've large map you can set a index skip value that allows you to keep in memory just a fraction of the index keys.

SetFile and ArrayFile are based on MapFile, and their implementation are

just few lines of code. The SetFile instead of append(key, value) as just the key field append(key) and the value is always the NullWritable instance. The ArrayFile as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1. The BloomMapFile extends the MapFile adding another file, the bloom file "/bloom", and this file contains a serialization of the DynamicBloomFilter filled with the added keys. The bloom file is written entirely during the close operation.

If you want to play with SequenceFile, MapFile, SetFile, ArrayFile without using Java, I've written a naive implementation in python. You can find it, in my github repository python-hadoop.

Wednesday, December 29, 2010

Disk Failure, Raid Corruption and CRCs

This week is started with a raid corruption due to a disk failure. Disk has continued to work without notifying anyone about the corruption, but this is somehow acceptable. The inacceptable thing is that the file-system is not aware about its data state. The "Old File-Systems" are not interested in user data and even with journaling, user data is not a priority.

As a Spare-Time File-System Developer, is really funny saying "With my File-System, This failure would not have happened!"

Simple Solution: B-Tree and CRCs
A really simple way to solve this problem is adding CRC to user data, or better (from file-system point of view) adding a CRC on every block. If the file-system is entirely based on B-Tree (when I say B-Tree, I mean B*Tree or B+Tree) you can simple add CRC for the block in the node header, and CRC for the child in the blocks pointer.

Excluding from a moment the super-block and other things... You start reading the root of the tree (1st block), if node crc check fail there's something wrong with your disk-read (or maybe your memory, but this is less probable). When you start traversing the tree the check is even better and "secure".

Checking the CRC on the block read is good, but having the CRC stored in a different location, and read at a different time is even better.

Storing CRC of the child node within the pointer gives you the ability to do a double integrity check. When data becomes larger and cannot be stored in a tree node, you can do the same thing with the extents (Pointers and CRC).

Another maybe more "intrusive" solution is storing a CRC for every N bytes of data, but I think that this more acceptable for a user-space "crc-fs" implementation (This approach is implemented in Hadoop's ChecksumFileSystem class).

Yesterday night I've implemented a quick and dirty B+Tree, it's not tuned for speed but for a didactic view. It has a couple of nice features like Pointers with CRC and Variable Length nodes to be able to compress nodes on disk. You can find the source code on my github repository (B+Tree Code).

Saturday, July 10, 2010

RaleighFS is moving to Flash Disks

No, this is not going to happen for a couple of years, I love spinning disk, because file-system can really do the difference in term of performance.

But… This week I've talked a bit with my colleague batt, that is working on some new features of his BattFS available on BeRTOS. And I've discovered that flash disk are challenging too, especially on embedded devices where you cannot use large cache due to ram limitation.

What changes, on flash, from a traditional File-System? First of all due to limited rewrite cycle of a block, you cannot modify a block in place (Append-Only Data Structure needed). Append-Only Index is not a problem, but Traditional file-system heavily rely on a fixed location super-block that contains the root pointer, and append-only tree root changes every time, so super-block need to be re-writed.

The first solution that RaleighFS can quickly implement is to write a new super-block each flush. In this way each flush (or transaction) can be easily identified. But how can I reach the last super-block and what is the last super-block?

The Worst case is to scan the whole disk to find the super-block "magic" number in block header.

But what we can do is to add some information to the super-block

Super-Block Id
Next Super-Block Id
Next Super-Block Location

In this way on each flush we "predict" what is the location of the next super-block, and we give it an unique id just to be sure to identify the next one. In this way we can avoid to scan the whole disk but just a few Mb (flush size) and then we can easily jump to the next super-block until the super-block next location isn't valid.

Block allocation is just a "round-robin" allocation in this way we're able to keep aligned "every" block erase cycle, and this solve the super-block location problem.

What about the tree? Tree is the common B*Tree of RaleighFS with a few changes:

Modified leaf node is allocated/writed in a new place
Root and Twigs Nodes are rewritten only if keys changes.

Wait wait, how we can reach the new leaf node? To avoid large Twigs rewrite we keep a small BTree (high fanout) that maps old leafs block numbers with the new ones. Free blocks can be stored on disk or better due to 0-seek time we can just check all the tree pointers to see what's allocated.

I'm not really convinced about the map BTree, but seems to solve the "large update" of value modification in key/value storage case… This is just a fast "port" of current RaleighFS to SSD but maybe something really different can be used...

Saturday, January 30, 2010

Directories and File-System, RaleighFS Way!

When you talk about a new file-system to someone, one of the first question that you will hear is "It will suffer from directory unbalancing" or "How performs with larger directories".

Ext3 seems to have spread the fear of large directory, but What is a directory? and what large directories mean?

When you design a File-System you've to decide what is your target/end-user. Don't try to design your file-system for everyone, but just for one.

So, what's your focus? Be the fastest file-system for continuous 'ls' (Directory traversing)?

My idea and RaleighFS idea, is that File-System is just a "Big Hashtable" with some useful information, where you ask for an item 'x' and you'll receive it's data, and metadata. Following this idea, what is a directory? It's just a name-container that allows user to keep track of his items. At this point, what is the best way to represent Directory?

Directory is just another item in the FS, like your mp3, c source file, and all the other stuff.

When you're a high-level developer, sometimes you forget how the low-level works, in this case the File-System. The FS has blocks when you remove something from the begin/mid of a file you shift back all the rest and you rewrite the entire file to avoid this you can intelligently pack your data based on the "block-data size" of the file system. (Block data size is not the block size, maybe the file-system add its own block header to your data).

The above picture shows the RaleighFS Directory Block data structure. There's a small block header that say how much of space are free in this block and there're directory items that are simply 2-field struct, 1 byte name size and N chars for the name.

block_size = 8192 byte
block_head = 12 byte
dir_head = 2 byte
avg_dir_item_size = 1byte + 15 byte (avg name size)
(block_size - block_head - dir_head) / avg_dir_item_size = 511 entries in one block

In a few blocks you can handle large amount of files, and the remove operation are really quick. Just find your item, move back just your "block" friends and decrease the Directory Header free space. In this way you don't have to rewrite all the item but just one or few blocks of it.

Remember, in RaleighFS, directory is just a name-container that allows user to keep track of his items (ls /home/) and on disk directory is just a file there's no unbalance. Every item is looked-up in a O(1) Time, like traditional Hashtables.

Saturday, December 12, 2009

Backups and Files Deduplication

Yesterday, I've heard about a "No Space Left" on Backup Machine, obviously the first solutions arrived was like "Buy a new disk, they don't cost so much" or "Try this ultra branded backup product"... and then there's my custom and easy solution that was rejected. :)

So, If you're a little bit brave and you're able to write a couple of lines in Python you can have a very flexible backup system with files deduplication.

The main idea is...

Foreach home directory that you've to backup, store a key/value list with file_path/shasum (or an hash function that you trust).

Foreach file in the home directory check if the file is already on the backup server (Store files on Backup Server with shasum as name). If the file was not present upload it.

In this way on the Backup server there're "two folders" one that contains all the backed-up files (storing by shasum, means that one file was on server just once), and one that contains home-th30z-10-Nov-2009.list, home-th30z-11-Nov-2009.list, and so on.

In this way you can say... Hei, fully restore my home at "specified date" or just peek a file that you've in another specified date.. and maybe all the features that you want.

This is a easy and flexible way to store your backup, without wasting space and have a great granularity level of restore...

If you've really groked the idea, you can start your start-up company focused on backups. :D

Saturday, September 5, 2009

File-System and Data Block Back Reference

While I'm thinking and waiting for suggestions on how to improve my file-system block cache algorithm, I've decided to apply some changes to the Raleigh File-System Format (source code is not published yet).

Following the ideas of Valerie Aurora of Repair-driven File System Design, I've decided to add for each block (B*Tree and Data blocks) an head that contains a Magic Number and a CRC Sum of the block. In this way you can easily identify what kind of block you've peeked without scanning all metadata. Another step is to add a back reference (or back pointer) to the data block, in this way you can easily jump back to it's the extent block (and obviously to its OID) so you can easily understand what is the Object owner of this block and you can easily swap two blocks reading at most 4 blocks (2 Data and 2 Extends).

Another idea stolen from Valerie is to double the metadata blocks with a COW-like approach, as explained in this paper "Double the Metadata, Double the Fun: A COW-like Approach to FS Consistency", really useful for personal file-systems but maybe less in a distributed file-system. I'm working on it adding only as an mkfs option.

When the source Code will be online? I don't know.. I've less time to work on it. Maybe at the end of this year I'll publish the File-System and the Distributed System (explained some posts ago).

Double the Metadata, Double the Fun: A COW-like Approach to File

System Consistency"

Tuesday, August 25, 2009

Block Cache Algorithm

I need to replace my old filesystem cache algorithm with something more new and efficient. The old one is based on LRU/LFU algorithm. There's a queue of cached blocks and an Hashtable to speedup block lookup.



struct blkcache_buf {
    struct blkcache_buf *  next;    /* Next Queue Item */
    struct blkcache_buf *  prev;    /* Prev Queue Item */
    struct blkcache_buf *  hash;    /* Next Item with the same hash */

    xuint16_t              count;   /* Retain count */
    xxx_block_t            block;   /* Cached Block */
};

typedef struct {
    struct blkcache_buf ** buf_hash;        /* Bufs Hashtable */
    xuint16_t              buf_hash_size;   /* Bufs Hashtable Size */
    xuint16_t              buf_used;        /* Bufs in use */

    struct blkcache_buf *  head;            /* Head of the Bufs Queue */
    struct blkcache_buf *  tail;            /* Tail of the Bufs Queue */

    xxx_device_t *         device;          /* Block Device used for I/O */
} xxx_blkcache_t;

Above, you can see the cache data structure and below the core of the cache Algorithm.



#define BUFHASH(cache, blocknr)     ((blocknr) % (cache)->buf_hash_size)

xxx_block_t *xxx_blkcache_read (xxx_blkcache_t *cache,
                                xxx_blkptr_t blocknr)
{
    struct blkcache_buf *buf;
    xuint16_t hash_index;

    /* Scan the hash chain for block */
    hash_index = BUFHASH(cache, blocknr);
    if ((buf = __blkcache_find(cache, blocknr, hash_index)) != NULL) {
        buf->count++;

        /* Move Buf far from head */
        __blkcache_buf_shift(cache, buf);

        return(&(buf->block));
    }

    /* Cache is Full, Remove one Item */
    if ((cache->buf_used + 1) > cache->buf_hash_size) {
        /* All buffers are in use */
        if (cache->head->count > 0)
            return(NULL);

        /* Remove Least-Frequently Used */
        buf = __blkcache_remove_lfu(cache);
        cache->buf_used--;
    }

    /* Desidered block is not on available chain, Read It! */
    if ((buf = __blkcache_buf_alloc(cache, buf, blocknr)) == NULL)
        return(NULL);

    /* Add One Use, Block cannot be removed */
    buf->count++;

    /* Go get the requested block unless searching or prefetching. */
    __blkcache_readwrite(cache, buf, RFALSE);

    /* Update Cache Hash */
    cache->buf_used++;
    buf->hash = cache->buf_hash[hash_index];
    cache->buf_hash[hash_index] = buf;

    /* Update Cache Queue */
    if (cache->head == NULL) {
        cache->head = cache->tail = buf;
    } else {
        buf->prev = cache->tail;
        cache->tail->next = buf;
        cache->tail = buf;
    }

    return(&(buf->block));
}

You can download a demo implementation here: lru-block-cache.c, but I'm waiting some ideas or suggestions to improve (or change radically) the Cache Algorithm. Thanks in advance!

Pages