Wednesday, February 9, 2011

HBase I/O: HFile

In the beginning HBase uses MapFile class to store data persistently to disk, and then (from version 0.20) a new file format is introduced. HFile is a specific implementation of MapFile with HBase related features.

HFile doesn't know anything about key and value struct/type (row key, qualifier, family, timestamp, …). As Hadoop' SequenceFile (Block-Compressed), keys and values are grouped in blocks, and blocks contains records. Each record has two Int Values that contains Key Length and Value Length followed by key and value byte-array.

HFile.Writer has only a couple of append overload methods, one for KeyValue class and the other for byte-array type. As for SequenceFile, each key added must be greater than the previous one. If this condition is not satisfied an IOException() is raised.

By default each 64k of data (key + value) records are squeezed together in a block and the block is written to the HFile OutputStream with the specified compression, if specified. Compression Algorithm and Block size are both (long)constructor arguments.

One thing that SequenceFile is not good at, is adding Metadata. Metadata can be added to SequenceFile just from the constructor, so you need to prepare all your metadata before creating the Writer.

HFile adds two "metadata" type. One called Meta-Block and the other called FileInfo. Both metadata types are kept in memory until close() is called. 

Meta-Block is designed to keep large amount of data and its key is a String, while FileInfo is a simple Map and is preferred for small information and keys and values are both byte-array. Region-server' StoreFile uses Meta-Blocks to store a BloomFilter, and FileInfo for Max SequenceId, Major compaction key and Timerange info.

On close(), Meta-Blocks and FileInfo is written to the OutputStream. To speedup lookups an Index is written for Data-Blocks and Meta-Blocks, Those indices contains n records (where n is the number of blocks) with block information (block offset, size and first key). 
At the end a Fixed File Trailer is written, this block contains offsets and counts for all the HFile Indices, HFile Version, Compression Codec and other few information.

Once the file is written, the next step is reading it. You've to start by loading FileInfo, the loadFileInfo() of HFile.Reader loads in memory the Trailer-block and all the indices, that allows to easily query keys. Through the HFileScanner you can seek to a specified key, and iterate over.
The picture above, describe the internal format of HFile...


  1. Hi Matteo,

    Great post! One small note: the meta blocks are optional and you could have a HFile that does not have any of them.


  2. i am having hard time understanding the value of a data block. What would HBase lose if concept of block is removed?
    You could compress whole file (in fact compression would be better), you could build an index for whole file. You can make file available by replicating it.

    1. Blocks are useful for a couple of reason, first you've an easy way to split the file and distribute the slice (e.g. for map reduce)
      If you compressed the whole file, how can you seek in the middle of the file and lookup for your keys without having to read other pieces to apply decompression? Now you can jump in the block that contains your key and decompress just that block, so you've a sequential read of just few MB (block size)

      (Also from hbase 0.92 a new file format HFile-v2 is introduced, and gives the possibility to add inline-blocks that means indexes and bloom filter at the end of data blocks... but this is another post that is coming soon).

  3. This comment has been removed by the author.

  4. Those indices contains n records (where n is the number of blocks) with block information (block offset, size and first key).
    the first key that is stored is the rowkey or rowkey+cf+column+timestamp key?

    The reason I ask is there could be cases when a row could span multiple data blocks. Having the second type of key as the 'first key' could speed up query(if we query for row+column value) right?

    Looks like the 'start key' that is stored is rowkey as that could influence flat-wide vs tall-narrow table design choice.