

The sequence file also can contains a "secondary" key-value list that can be used as file Metadata. This key-value list can be just a Text/Text pair, and is written to the file during the initialization that happens in the SequenceFile.Writer constructor, so you can't edit your metadata.

As seen Sequence File has 3 available formats, the "Uncompressed" and the "Record Compressed" are really similar. Each call to the append() method adds a record to the sequence file the record contains the length of the whole record (key length + value length), the length of the key and the raw data of key and value. The difference between the compressed and the uncompressed version is that the value raw data is compressed, with the specified codec, or not.

As you can see in the figure on the left, a block record contains a VInt with the number of the buffered records and 4 compressed blocks that contains a list with the length of the keys, the list of keys, another list with the length of the values and finally the list of values. Before each block a sync marker is written.
Hadoop SequenceFile is the base data structure for the other types of files, like MapFile, SetFile, ArrayFile and BloomMapFile.
The MapFile is a directory that contains two SequenceFile: the data file ("/data") and the index file ("/index"). The data contains all the key, value records but key N + 1 must be greater then or equal to the key N. This condition is checked during the append() operation, if checkKey fail it throws an IOException "Key out of order".

SetFile and ArrayFile are based on MapFile, and their implementation are
just few lines of code. The SetFile instead of append(key, value) as just the key field append(key) and the value is always the NullWritable instance. The ArrayFile as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1. The BloomMapFile extends the MapFile adding another file, the bloom file "/bloom", and this file contains a serialization of the DynamicBloomFilter filled with the added keys. The bloom file is written entirely during the close operation.
If you want to play with SequenceFile, MapFile, SetFile, ArrayFile without using Java, I've written a naive implementation in python. You can find it, in my github repository python-hadoop.
If you want to play with SequenceFile, MapFile, SetFile, ArrayFile without using Java, I've written a naive implementation in python. You can find it, in my github repository python-hadoop.
hi, is there any code snippet on how to set metadata in a sequence file ?
ReplyDeleteSequenceFile.Metadata meta = new SequenceFile.Metadata();
ReplyDeletemeta.set("field0", "value0");
meta.set("field1", "value1");
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, file,
Datum.class, Datum.class, compressionType, codec, null, metadata);
...
thks ! that helps a lot !
ReplyDeleteis there a way to do this in python also ? could you point me to some useful links about sequence files usage with python and hive.
I've written a simple python example of SequenceFile with Metadata:
ReplyDeletehttps://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/examples/SequenceFileMeta.py
If you give me, more information about what you need to know about sequencefiles, python and hive, I can write an example for you