
For example read from disk, write to a block cache, read a piece of data and put it in another other data-structure for processing.
One way to avoid all this memcpy and data duplication is to use a copy-on-write data-structure. For strings or "single-block-data" is pretty easy. As you can see on picture above, you've a chunk of memory that is referenced from different objects. Each object keeps a pointer to the data and if needed an offset and a length to reference just a part of the data.
If someone call write() on one of those objects, internally data-blob is duplicated and modified. In this way the modified objects points to a new block and the others points to the old block.

Extent-tree that i use for the Flow Objects in RaleighFS plus copy-on-write on blocks has the behavior that I need to avoid all this memcpy and data duplication.
Extent-Tree allows you to read from a list of blocks as a single block. Each block as an offset and a length, and blocks are sorted by offset, in the tree. This allows you to reach your specified offset fast and insert or remove at any specified offset.
Using this extent-tree has the advantage that you can avoid to memcpy large blocks from a data-structure to another, you can merge two extent tree or do other fancy thing without allocate new memory and copy data to it. But insert and split require a malloc, so you've to tune your node and extent allocation with an object pool, and this is pretty simple due to the fixed size of those object.
I've made a draft implementation that you can find on my github repo.
No comments:
Post a Comment