But… This week I've talked a bit with my colleague batt, that is working on some new features of his BattFS available on BeRTOS. And I've discovered that flash disk are challenging too, especially on embedded devices where you cannot use large cache due to ram limitation.
What changes, on flash, from a traditional File-System? First of all due to limited rewrite cycle of a block, you cannot modify a block in place (Append-Only Data Structure needed). Append-Only Index is not a problem, but Traditional file-system heavily rely on a fixed location super-block that contains the root pointer, and append-only tree root changes every time, so super-block need to be re-writed.
The first solution that RaleighFS can quickly implement is to write a new super-block each flush. In this way each flush (or transaction) can be easily identified. But how can I reach the last super-block and what is the last super-block?
The Worst case is to scan the whole disk to find the super-block "magic" number in block header.
But what we can do is to add some information to the super-block
- Super-Block Id
- Next Super-Block Id
- Next Super-Block Location
In this way on each flush we "predict" what is the location of the next super-block, and we give it an unique id just to be sure to identify the next one. In this way we can avoid to scan the whole disk but just a few Mb (flush size) and then we can easily jump to the next super-block until the super-block next location isn't valid.
Block allocation is just a "round-robin" allocation in this way we're able to keep aligned "every" block erase cycle, and this solve the super-block location problem.
What about the tree? Tree is the common B*Tree of RaleighFS with a few changes:
- Modified leaf node is allocated/writed in a new place
- Root and Twigs Nodes are rewritten only if keys changes.
Wait wait, how we can reach the new leaf node? To avoid large Twigs rewrite we keep a small BTree (high fanout) that maps old leafs block numbers with the new ones. Free blocks can be stored on disk or better due to 0-seek time we can just check all the tree pointers to see what's allocated.
I'm not really convinced about the map BTree, but seems to solve the "large update" of value modification in key/value storage case… This is just a fast "port" of current RaleighFS to SSD but maybe something really different can be used...