Th30z (Matteo Bertozzi Code): 2011

Saturday, November 19, 2011

Drawing Charts with Python & Qt4

Back again with some graphics stuff that can be useful in monitoring applications.
As you can immagine, this charts are 100% QPainter. The source code is available on blog-code@github.

Pie Chart

table = DataTable()
table.addColumn('Lang')
table.addColumn('Rating')
table.addRow(['Java', 17.874])
table.addRow(['C', 17.322])
table.addRow(['C++', 8.084])
table.addRow(['C#', 7.319])
table.addRow(['PHP', 6.096])

chart = PieChart(table)
chart.save('pie.png', QSize(240, 240), legend_width=100)

The usage is really simple, you have to create your table with columns and data, create a chart object using the table that you've created and you can call the draw() or save() method to show/save the chart somewhere.

Scattered Chart

chart = ScatterChart(table)
chart.haxis_title = 'Proc input'
chart.haxis_vmin = 0
chart.haxis_vmax = 16
chart.haxis_step = 2
chart.vaxis_title = 'Quality'
chart.vaxis_vmin = 90
chart.vaxis_vmax = 104
chart.vaxis_step = 1

You can customize the min/max value and the step of horizontal and vertical axis, ore you can use the default calculated on your data. You can also set the Reference column with setHorizontalAxisColumn() or setVerticalAxisColumn().

Area Chart

table = DataTable()
table.addColumn('Time')
table.addColumn('Site 1')
table.addColumn('Site 2')
table.addRow([ 4.00, 120,   500])
table.addRow([ 6.00, 270,   460])
table.addRow([ 8.30, 1260, 1120])
table.addRow([10.15, 2030,  540])
table.addRow([12.00,  520,  890])
table.addRow([18.20, 1862, 1500])

chart = AreaChart(table)
chart.setHorizontalAxisColumn(0)
chart.haxis_title = 'Time'
chart.haxis_vmin = 0.0
chart.haxis_vmax = 20.0
chart.haxis_step = 5

chart.save('area.png', QSize(400, 240), legend_width=100)

Line Chart

chart = LineChart(table)
chart.setHorizontalAxisColumn(0)
chart.haxis_title = 'Time'
chart.haxis_vmin = 0.0
chart.haxis_vmax = 20.0
chart.haxis_step = 2

Once again, the code is available on github at blog-code/qt4-charts/chart.py.

Don't miss the Florence Qt Day 2012

The conference will take place on 27/28 January 2012, AC Hotel Firenze Porta al Prato (Florence, Italy). And it is Free!

The Qt Project
Qt 5.0
Qt Quick
Qt WebKit
Performance & Profiling
Qt in Use
...And many more

Take a look at http://www.qtday.it for more information.

Saturday, July 30, 2011

RaleighFS to enter in the in-memory key-value store market

A couple of guys asked me about RaleighFS, and why is called File-System instead of Database, and the answer is that the project is started back in 2005 as a simple Linux Kernel File-System, to evolve in something different.

Abstract Storage Layer
I like to say that RaleighFS is an Abstract Storage Layer, because is main components are designed to be plugable. For example the namespace can be flat or hierarchical, and the other objects don't feel the difference.

Store Multiple Objects with different Types (HashTable, SkipList, Tree, Extents, Bitmap, ...)
Each Object as it's own on-disk format (Log, B*Tree, ...).
Observable Objects - Get Notified when something change.
Flexible Namespace & Semantic for Objects.
Various Plain-Text & Binary Protocol Support (Memcache, ...)

A New Beginning...
Starting weeks ago, I've decided to rewrite and refactor a bit of code, stabilize the API and, this time, trying to bring the file-system and the network layer near to a stable release.
First Steps are:

Release a functional network layer as soon as I can.
Providing a pluggable protocol interface.
Implement a memcache capable and other protocols.

So, these first steps are all about networking, and unfortunately, this means dropping the sync part and keep just the the in-memory code (the file-system flush on memory pressure).

Current Status:
Starting from today, some code is available on github under raleighfs project.

src/zcl contains the abstraction classes and some tool that is used by every piece of code.
src/raleighfs-core contains the file-system core module.
src-raleighfs-plugins contains all the file-system's pluggable objects and semantics layers.
src/raleigh-server currently contains the entry point to run a memcache compatible (memccapable text protocol), and a redis get/set interface server. The in-memory storage is relegated in engine.{h,c} and is currently based on a Chained HashTable or a Skip List or a Binary Tree.

How it Works
As I said before the entry point is the ioloop, that allows clients to interactot through a specified protocol with the file-system's objects. Each "protocol handler" parse it's own format, convert it to the file-system one, and enqueue the request to a RequestQ that dispatch the request to the file-system. When the file-system has the answer push the response into the RequestQ and the client is notified. The inverse process is applied the file-system protocol is parsed and converted into the client one.

Having the RequestQ has a couple advantages, the first one is that you can wrap easily a protocol to communicate with the filesystem, the other one is that the RequestQ can dispatch the request to different servers. Another advantage is that the RequestQ can operate as a Read-Write Lock for each object allowing the file-system to have less lock...

For more information ask me at theo.bertozzi (at) gmail.com.

Friday, February 18, 2011

Getting started with Avro

Apache Avro is a language-neutral data serialization system. It's own data format can be processed by many languages (currently C, C++, Python, Java, Ruby and PHP).

Avro provides:

Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).

Data Schema
Avro relies on schemas, that specifies which fields and types an object is made. In this way, each datum is written with no per-value overhead. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program.

Avro has traditional Primitive Types like int, float, string, ... and other Complex Types like enum, record, array, ... You can use this types to create your own complex types like in example below:

{
  "type": "record", 
  "name": "Person", 
  "fields": [
      {"name": "name", "type": "string"},
      {"name": "age", "type": "int"},
      {"name": "emails", "type": {"type": "array", "values": "string"}},

      {"name": "father", "type": "Person"},
      {"name": "mother", "type": "Person"},
  ]
}

You can create schemes by code using the Schema class methods, or just parsing a json file using the Schema.parse() method.

Reading & Writing
Once that you've written the schema you can start to serialize your objects, generating the right data structure for your types.

An example of serialization for the schema written above, can be something like:

public GenericData.Record serialize() {
  GenericData.Record record = new GenericData.Record(schema);

  record.put("name", this.name);
  record.put("age", this.age);

  int nemails = this.mails.length();
  GenericData.Array emails = new GenericData.Array(nemails, emails_schema);
  for (int i = 0; i < nemails; ++i)
     record.put(this.mails[i]);
  record.put("emails", emails);

  record.put("father", this.father.serialize());
  record.put("mother", this.mother.serialize());
}

The same code written in python looks like this:

def serialize(self):
  return { 'name', self.name, 
           'age', self.age, 
           'emails': self.mails, 
           'father': self.father.serialize()
           'mother': self.mother.serialize()
         }

Now that you've the Avro Object that reflect the schema, you've just to write it. To do that you've to use a DatumWriter that uses an Encoder to write the datum on an OutputStream.

...
Encoder encoder = BinaryEncoder(outputStream); 
GenericDatumWriter datumWriter = new GenericDatumWriter(schema);
datumWriter.write(person.serialize(), encoder);
...

As you can imagine, reading data is quite similar. Pick up a DatumReader and a Decoder that can read from an InputStream and start reading your Avro Objects.

...
Decoder decoder = BinaryDecoder(inputStream); 
GenericDatumReader datumReader = new GenericDatumReader(schema);

GenericData.Record record = new GenericData.Record(schema);
while (...) {
  datumWriter.read(record, decoder);
  // record.get("name")
  // record.get("...");
}
...

Object Container Files
Avro includes an object container file format. A file has a schema, and all objects stored in the file must be written according to that schema.

Since Avro is designed to be used with Hadoop and Map-Reduce, the file format is similar to the Hadoop' SequenceFile. Objects are grouped in blocks and each block may be compressed. Between each block a sync-marker is added to allow an efficient splitting of files.

void testWrite (File file, Schema schema) throws IOException {
   GenericDatumWriter datum = new GenericDatumWriter(schema);
   DataFileWriter writer = new DataFileWriter(datum);

   writer.setMeta("Meta-Key0", "Meta-Value0");
   writer.setMeta("Meta-Key1", "Meta-Value1");

   writer.create(schema, file);
   for (Person p : people)
      writer.append(p.serialize());

   writer.close();
}

Since the File contains the schema, you don't need to specify a schema for the reader. You can extract the used by calling the getSchema() method of the reader.

void testRead (File file) throws IOException {
   GenericDatumReader datum = new GenericDatumReader();
   DataFileReader reader = new DataFileReader(file, datum);

   GenericData.Record record = new GenericData.Record(reader.getSchema());
   while (reader.hasNext()) {
     reader.next(record);
     System.out.println("Name " + record.get("name") + 
                        " Age " + record.get("age"));
   }

   reader.close();
}

For a more "advanced" reading operation, (see the File Evolution example), you can specify the expected file schema.

Data Evolution
The first problem that you'll encounter working with a custom binary format, or even using an XML/JSON based, is to deal with data evolution.

During your application development you will surely have to add, remove or rename some fields from the various data structures. To solve the compatibility problem you've to introduce a "versioning step", that transforms your "Version X" document to a "Version (X + 1)" format.

Avro has this problem solved applying some Schema Resolution rules. In brief...

If a field is added, old document doesn't contains the field and the new readers uses the default value, specified in the schema.
If a field is removed, new readers doesn't read it and no one cares if it's present or not.
If a field is renamed, new schema must contains the old name of the field in it's alias list.

Note, that only the forward compatibility is covered.

Inter-Process Calls
Avro makes RPC really easy, with a few lines of code you can write a full client/server.

First of all, you need to write your protocol schema, specifying each messasge with request and response objects.

{
  "namespace": "test.proto",
  "protocol": "Test",

  "messages": {
    "xyz": {
        "request": [{"name": "key", "type": "string"}],
        "response": ["bytes", "null"]
    }
  }
}

A simple java server can be written in this way. You define the respond method, handling each message, and for each message you return the related response object.

static class Responder extends GenericResponder {
  public Object respond (Protocol.Message message, Object request) {
    String msgName = message.getName();

    if (msgName == "xyz") {
      // Make a response for 'xyz' get data from request 
      // GenericData.Record record = (GenericData.Record)request;
      // e.g. record.get("key")
      return(response_obj);
    }

    throw new AvroRuntimeException("unexcepcted message: " + msgName);
  }
}

public static void main (String[] args) 
  throws InterruptedException, IOException
{
  Protocol protocol = Protocol.parse(new File("proto-schema.avpr"));
  HttpServer server = new HttpServer(new Responder(protocol), 8080);
  server.start();
  server.join();
}

The client connects to the server, and send the message with the specified schema format.

def main():
  proto = protocol.parse(file('proto-schema.avpr').read())

  client = ipc.HTTPTransceiver('localhost', 8080)
  requestor = ipc.Requestor(proto, client)
  result = requestor.request('xyz', {'key': 'Test Key'})
  print result

  client.close()

I've made a couple of examples (written in C, Java and Python) that shows how to use Avro Serialization and Avro IPC. Sources are available on my github at avro-examples.

Saturday, February 12, 2011

Linux cgroups: Memory Threshold Notifier

Through cgroups Notification API you can be notified about changing status of a cgroup. Memory cgroup implements memory thresholds using cgroups notification API, It allows to register multiple memory and memsw thresholds and gets notifications when it crosses.

This can be very useful if you want to maintain a cache but you don't want to exceed a certain size of memory.

To register a threshold application need:

create an eventfd using eventfd(2);
open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>"

...
// Open cgroup file (e.g. "memory.oom_control" or "memory.usage_in_byte")
snprintf(path, PATH_MAX, "%s/%s", cgroup, file);
cgroup_ctrl->fd = open(path, O_RDONLY);
cgroup_ctrl->efd = eventfd(0, 0);

// Prepare ctrl_string e.g.
// <event_fd> <fd of memory.oom_control>
// <event_fd> <fd of memory.usage_in_bytes> <threshold>
snprintf(ctrl_string, CTRL_STRING_MAX,
         "%d %d", cgroup_ctrl->efd, cgroup_ctrl->fd);

// Write ctrl_string to cgroup event_control
snprintf(path, PATH_MAX, "%s/cgroup.event_control", cgroup);
fd = open(path, O_WRONLY);
write(fd, ctrl_string, strlen(ctrl_string));
close(fd);
...

Now you can add eventfd() descriptor to your epoll()/select() event loop and wait your notification. Here, you can handle your cache release.

...
nfds = epoll_wait(epfd, events, NEVENTS, TIMEOUT);
for (i = 0; i < nfds; ++i) {
    if (events[i].data.fd == cgroup_ctrl->efd) {
        /* Memory Threshold Notification */
        read(cgroup_ctrl->efd, &result, sizeof(uint64_t));

        /* free some memory */
    }
}
...

A full demo source code is avable on github at cgroup-mem-threshold demo.

Wednesday, February 9, 2011

HBase I/O: HFile

In the beginning HBase uses MapFile class to store data persistently to disk, and then (from version 0.20) a new file format is introduced. HFile is a specific implementation of MapFile with HBase related features.

HFile doesn't know anything about key and value struct/type (row key, qualifier, family, timestamp, …). As Hadoop' SequenceFile (Block-Compressed), keys and values are grouped in blocks, and blocks contains records. Each record has two Int Values that contains Key Length and Value Length followed by key and value byte-array.

HFile.Writer has only a couple of append overload methods, one for KeyValue class and the other for byte-array type. As for SequenceFile, each key added must be greater than the previous one. If this condition is not satisfied an IOException() is raised.

By default each 64k of data (key + value) records are squeezed together in a block and the block is written to the HFile OutputStream with the specified compression, if specified. Compression Algorithm and Block size are both (long)constructor arguments.

One thing that SequenceFile is not good at, is adding Metadata. Metadata can be added to SequenceFile just from the constructor, so you need to prepare all your metadata before creating the Writer.

HFile adds two "metadata" type. One called Meta-Block and the other called FileInfo. Both metadata types are kept in memory until close() is called.

Meta-Block is designed to keep large amount of data and its key is a String, while FileInfo is a simple Map and is preferred for small information and keys and values are both byte-array. Region-server' StoreFile uses Meta-Blocks to store a BloomFilter, and FileInfo for Max SequenceId, Major compaction key and Timerange info.

On close(), Meta-Blocks and FileInfo is written to the OutputStream. To speedup lookups an Index is written for Data-Blocks and Meta-Blocks, Those indices contains n records (where n is the number of blocks) with block information (block offset, size and first key).
At the end a Fixed File Trailer is written, this block contains offsets and counts for all the HFile Indices, HFile Version, Compression Codec and other few information.

Once the file is written, the next step is reading it. You've to start by loading FileInfo, the loadFileInfo() of HFile.Reader loads in memory the Trailer-block and all the indices, that allows to easily query keys. Through the HFileScanner you can seek to a specified key, and iterate over.

The picture above, describe the internal format of HFile...

Sunday, January 30, 2011

Zero-Copy Memory

Looking at my code paths, I've seen many methods that read from a data-structure to a buffer and then write back it to another data structure, and so on...

For example read from disk, write to a block cache, read a piece of data and put it in another other data-structure for processing.

One way to avoid all this memcpy and data duplication is to use a copy-on-write data-structure. For strings or "single-block-data" is pretty easy. As you can see on picture above, you've a chunk of memory that is referenced from different objects. Each object keeps a pointer to the data and if needed an offset and a length to reference just a part of the data.

If someone call write() on one of those objects, internally data-blob is duplicated and modified. In this way the modified objects points to a new block and the others points to the old block.

This is a really simple copy-on-write technique and the string class is really suitable for that. But looking at my code, this case doesn't happen often... In the common case, I've object that references two or more blocks and sometimes happens that i need to remove or inject data in the middle of the buffer.

Extent-tree that i use for the Flow Objects in RaleighFS plus copy-on-write on blocks has the behavior that I need to avoid all this memcpy and data duplication.

Extent-Tree allows you to read from a list of blocks as a single block. Each block as an offset and a length, and blocks are sorted by offset, in the tree. This allows you to reach your specified offset fast and insert or remove at any specified offset.

Using this extent-tree has the advantage that you can avoid to memcpy large blocks from a data-structure to another, you can merge two extent tree or do other fancy thing without allocate new memory and copy data to it. But insert and split require a malloc, so you've to tune your node and extent allocation with an object pool, and this is pretty simple due to the fixed size of those object.

I've made a draft implementation that you can find on my github repo.

Sunday, January 16, 2011

Hadoop I/O: Sequence, Map, Set, Array, BloomMap Files

Hadoop' SequenceFile provide a persistent data structure for binary key-value pairs. In contrast with other persistent key-value data structures like B-Trees, you can't seek to a specified key editing, adding or removing it. This file is append-only.

SequenceFile has 3 available formats: An "Uncompressed" format, A "Record Compressed" format and a "Block-Compressed". All of them share a header that contains a couple of information that allows the reader to recognize is format. There're Key and Value Class Name that allows the Reader to instantiate those classes, via reflection, for reading. The version number and format (Is Compressed, Is Block Compressed), if compression is enabled the Compression Codec class name field is added to the header.

The sequence file also can contains a "secondary" key-value list that can be used as file Metadata. This key-value list can be just a Text/Text pair, and is written to the file during the initialization that happens in the SequenceFile.Writer constructor, so you can't edit your metadata.

As seen Sequence File has 3 available formats, the "Uncompressed" and the "Record Compressed" are really similar. Each call to the append() method adds a record to the sequence file the record contains the length of the whole record (key length + value length), the length of the key and the raw data of key and value. The difference between the compressed and the uncompressed version is that the value raw data is compressed, with the specified codec, or not.

In contrast the "Block-Compressed" format is more compression-aggressive. Data is not written until it reach a threshold, and when the threshold is reached all keys are compressed together, the same happens for the values and the auxiliary lists of key and value lengths.
As you can see in the figure on the left, a block record contains a VInt with the number of the buffered records and 4 compressed blocks that contains a list with the length of the keys, the list of keys, another list with the length of the values and finally the list of values. Before each block a sync marker is written.

Hadoop SequenceFile is the base data structure for the other types of files, like MapFile, SetFile, ArrayFile and BloomMapFile.

The MapFile is a directory that contains two SequenceFile: the data file ("/data") and the index file ("/index"). The data contains all the key, value records but key N + 1 must be greater then or equal to the key N. This condition is checked during the append() operation, if checkKey fail it throws an IOException "Key out of order".

The Index file is populated with the key and a LongWritable that contains the starting byte position of the record. Index does't contains all the keys but just a fraction of the keys, you can specify the indexInterval calling setIndexInterval() method. The Index is read enteirely into memory, so if you've large map you can set a index skip value that allows you to keep in memory just a fraction of the index keys.

SetFile and ArrayFile are based on MapFile, and their implementation are

just few lines of code. The SetFile instead of append(key, value) as just the key field append(key) and the value is always the NullWritable instance. The ArrayFile as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1. The BloomMapFile extends the MapFile adding another file, the bloom file "/bloom", and this file contains a serialization of the DynamicBloomFilter filled with the added keys. The bloom file is written entirely during the close operation.

If you want to play with SequenceFile, MapFile, SetFile, ArrayFile without using Java, I've written a naive implementation in python. You can find it, in my github repository python-hadoop.

Saturday, January 8, 2011

[TIP] Daily Repository Diff via Mail

If your 200 mail every morning are still not enough, you can add this script to your daily cron.

The following scripts that you can find here, allows you to send diffs of repositories that you follow. The bash scripts below allows you to update your git/svn/hg repository and keep a diff in a plain and html format (using ansi2html).

git_diff() {
    cd $repo_url/$1

    git_repo_url=`git remote show origin | grep "Fetch URL" | cut -d ' ' -f 5-`
    echo "GIT Diff $1 ($2) - $git_repo_url"

    git fetch
    git diff --color HEAD origin/HEAD | $ansi2html > $diff_dir/$2.html
    git diff HEAD origin/HEAD > $diff_dir/$2.diff
    git merge origin/HEAD
}

hg_diff() {
    cd $repo_url/$1

    hg_repo_url=`hg showconfig | grep paths\.default | cut -d '=' -f 2-`
    echo "HG Diff $1 ($2) - $hg_repo_url"

    hg incoming --patch --git | $ansi2html > $diff_dir/$2.html
    hg incoming --patch --git  > $diff_dir/$2.diff
    hg pull -u
}

svn_diff() {
    cd $repo_url/$1

    svn_repo_url=`svn info | grep URL | cut -d ' ' -f 2-`
    svn_repo_rev=`svn info | grep "Last Changed Rev" | cut -d ' ' -f 4-`
    echo "SVN Diff $1 ($2) - $svn_repo_url"

    svn di $svn_repo_url -r$svn_repo_rev | $ansi2html > $diff_dir/$2.html
    svn di $svn_repo_url -r$svn_repo_rev > $diff_dir/$2.diff
    svn up
}

# Fetch my repos (xxx_diff repo_path diff_name)
git_diff "linux/linux-2.6" "linux-2.6"
svn_diff "apache/lucene" "lucene"
hg_diff "java/jdk" "hotspot-jdk7"

After running repo-diff script that allows you to update your favorites repositories and saving diff files, you can send them using he send-mail script.

diff_dir="~/.repo-diffs"
mail_address="th30z@localhost"

for html_file in `ls -1 $diff_dir/*.html` ; do
    repo_name=`basename $html_file | sed 's/\.html$//g'`
    diff_file=`echo $html_file | sed 's/\.html$/\.diff/g'`

    boundary="==`echo $repo_name | md5sum | cut -d ' ' -f -1`"
    alt_boundary="==`echo $boundary | md5sum | cut -d ' ' -f -1`"

    echo "Send Repo Diff $repo_name - $html_file"
    (
        echo "MIME-Version: 1.0"
        echo "Subject: Repo-Diff: $repo_name"
        echo "To: $mail_address"
        echo "Content-Type: multipart/mixed; boundary=$boundary"

        echo "--$boundary"
        echo "Content-Type: multipart/alternative; boundary=$alt_boundary"
        echo

        echo "--$alt_boundary"
        echo "Content-Type: text/plain"
        echo
        cat $diff_file

        echo "--$alt_boundary"
        echo "Content-Type: text/html"
        echo
        cat $html_file

        echo
        echo "--$alt_boundary--"
        echo "--$boundary"
        echo "Content-Type: Application/Binary_Attachment; 
                            name=\"`basename $diff_file`\""
        echo "Content-Disposition: attachment; 
                                   filename=\"`basename $diff_file`\""
        echo "Content-Transfer-Encoding: uuencode"
        echo
        uuencode $diff_file $diff_file
    ) | sendmail $mail_address
done

This script, for each file generated from the repo-diff script sends you a mail with the diff as body and attachment.

Scripts are available on my github repository under blog-code/repo-mail-diff.

https://github.com/matteobertozzi/blog-code.git

Pages