Th30z (Matteo Bertozzi Code)

New Blog Url

2023-02-04T02:51:00.002-08:00

https://matteobertozzi.github.io/

Data-center Rolling Upgrades coordinated by ZooKeeper

2012-06-24T10:26:00.000-07:00

Still playing around trying to improve the daily deploy work in the data-centers.

The idea is to replace a sequential/semi-manual process with something more automatic that don't need human intervention unless some failure happens.

Services and Deploy rules:

Services has dependencies (Service B depends on Service A), Deploy order matter!
You can't bring down all the machines at the same time!
One or more machine can be unreachable during the deploy (network problems, hw failures, ...).
Each machine need to be self-sufficient!

Must to Have (Monitoring)

Current service state of each machines (online/offline, service v1, v2)
Current "deploy" state (Ready to roll?)

The idea is quite simple, using ZooKeeper to keep track of each Service (A, B, ..., K) with the list of machines available (ephemeral znodes) and to keep track of te deploy state ("staging").

/dc/current: Contains a list of services with the list of online machines (and relative service version).
/dc/staging: Contains a list of services with the list of machines ready to roll.
/dc/deploy: Deploy order queue each node represent the service to upgrade.

When you're ready to deploy something new you can create the new znodes:

Add services to "staging" with the useful metadata (version, download path, ...)
Define a deploy "order" queue

Each service is notified about the new staging version and starts downloading (see "data-center deploy using torrent and mlock()" post). Once the download is completed, the service register it self to the "staging" queue.

Now the tricky part is when can I start switching to the new version? The idea is to specify a quorum foreach service. The First machine in the "Staging" queue for the first service in the "Deploy" queue, looks for the quorum, and when is time shutdown it self and restart the new service. Once is done adds it self to the "Current" list and remove it self from the staging queue.

And one by one each machine start upgrading it self, until the deploy queue is empty. If a machine is down during the deploy, the "Current" node is checked to find which version is the most popular, and the service will be started.

Data-center deploy using torrent and mlock()

2012-06-02T09:53:00.001-07:00

Every morning you come in the office hoping that the nightly job that produce your blobs has finished... and if everything is fine, you spend the rest of the day hoping that none of the machines fails during transfer...
If you've a service that consume static data and you've more than one datacenter, probably everyday you face the problem of distributing data on all your service machines.

Remember: 60MiB/sec * 1hour = ~210GiB

So what are the possible solution to transfer this blobs?

The first solution is copying all the data to one machine in each datacenter, and then each machine with the data is responsible to copy everything to all the other machines inside the datacenter.
Note: prefer rsync over scp, since if you lost connection with scp you need to retransfer everything from byte zero.

But what happens if a machine is down?
One of the solution is making all the machines part of this distribution, removing identities. Every machine is equal, every machine need to fetch these blobs. So, instead of using rsync from the "build" machine to the dist-host and from dist-host to service machines the "build" machine send an information "I've new data" and each machine starts fetching this data in a collaborative way (using bittorrent).

Each machine (build-machine/dist-hosts/services) need to run a torrent client, you can implement your torrent client in few lines of python using libtorrent. The idea is to fetch from a feed hosted on a build machine the latest blobs.torrent and start downloading. The build machine will be the initial seeder, but then every machine will be part of the data distribution. By writing your own tracker you can also tune your peer selection, preferring machines inside your datacenter or inside your rack to avoid cross-site latency.

Another important thing to remember, if your service rely on the buffer-cache to keep data in memory, is to tell to the system, to avoid evict your pages otherwise you'll probably see your service starting to slow down once you start to copy data to that machine... So make sure to mlock() your heavily used pages or if your blobs can be kept in memory use vmtouch to do the trick (vmtouch -l -d -m 64G blob) remember to add memlock entry for your user in /etc/security/limits.d/ otherwise you'll see mlock() fail.

You can find the source code of a simple command line bit-torrent client and a tracker at https://github.com/matteobertozzi/misc-common/tree/master/torrent.

Improve and Tune your service/app with some statistics

2012-04-21T19:00:00.000-07:00

One of the good thing to be in a data-driven company is that every decision must be based on the data that you've collected. For someone this means just Marketing decision, but you can do the same thing to improve your services, applications and code.

Think at these questions:

Is my code faster/slower between version A and B?
Is my code using much/more memory between version A and B?
Is someone still using feature A?
Which are the most used features?

If you look at the question, you can easy realize that these are not problems related just to big companies with lots of data, so even your small application can benefit from some stats.

One of the main stopper to do that is that is really difficult modify your application to add some stats support, because you really don't know what are your questions and you don't know what kind of output do you want.

What do you want is just a tiny call like: collect("func(x) time", 10sec)
And sometimes later you can decide.. ok I want to see what is the average of func(x) time between jan-feb (version A) and mar-apr (version B).
Or if you want keep track of features used you can call something like: collect("feature-used", "My Feature A"). And later you can decide to query for specified feature X to see when is last time that was used, or you can query for the most used features or something else.. but is really difficult to know in advance what you want to keep track.

Ok, now that you've understood a bit the problem that we want to solve, the fun part begins.
The main idea is to have a super light-weight "Stats Uploader" to collect your data with one single line of code and send to a collector, later on you can ask questions to your data (completely detached from your application).

As you can see from the schema above, your application send data to a "Collector" service, that can store your information in different ways (You can write your custom Sink to take care of specific keys, and store in a format that fit better your needs).
The Aggregator fetch the data required to answer your question and applies your custom logic to extract your answer.
The Viewer is just a front-end to display nicely your data, like a web service that plot some charts and table. It ask questions to the aggregator and displays to you.

The code is available on github at https://github.com/matteobertozzi/skvoz.
Probably I'll give a talk about this at EuroPython (EP2012) .

Embedded Store, under the hood...

2012-04-20T20:00:00.000-07:00

This week I've found an interesting bug, that can be summarized in this way. The user does not have any idea of what happens under the hood, and his main usage is always against your design.

To give you more context, the bug was related to embedded storage systems, something like bsddb, sqlite or just your simple B+Tree or your on-disk HashTable.

So, How an embedded storage system is designed?
As you can see from the simplified schema on the right

The lowest level is the raw access to the disk data structure (e.g. B+Tree, HashTable, ...) so each request goest directly to disk.
On top of that, to speedup things, you add a cache to avoid fetching data from disk all the time.
And everything is packed in a nice api that provides you some sort of get and put functionality, at maximum speed using the cache.

Everything seems fine, You can create an application that access your library capable of handling tons of request without accessing the disk due to the cache layer and so on, and you can think even to build a service to be able to query your storage from a remote host, and here the problems begin.

Your first super happy user arrive and decide to build its super scalable infrastructure with your embedded storage.
..and what is the easy way to get a super scalable service? obviously adding some threads.. but threads are not a problem, because the user has learned the lesson and now knows that he should not use shared variables. So the brilliant solution is each thread has its own instance of the "storage object", to be more clear each thread do something like db.open("super-storage.db")

Everything seems fine... but after a couples of days the user started crying... sometimes data is missing, logs contains strange page not found messages, some part of the store is corrupted, and so on...

Can you spot the problem? Yes, is the cache...
No one is aware of the changes in the file, every thread use its own cache, and the first request to a block not in cache ends up to create some bad state.

So the solution for the user is to use the library as you've designed, with just one thread/process/what ever that access the store file.

But if you want slow down your super cool library and make the user happy you can always add an ID to the super-block and every time the user request something... you fetch the super-block from disk, compare with the one in cache, and if they are different you can just invalidate all the caches...

Thoughts on bucket flushing, block size and compression...

2012-02-26T05:10:00.000-08:00

This post is just a collection of thoughts, on how to store data. I hope to get some feedback and new ideas from you guys, thanks!

Since I've started working on B*Trees, I've always assumed to have a fixed block size. With this "restriction" & "simplification", you can easily come up with a block in this format:

Keys are packed together in the beginning of the block and values are packed together in the end of the block, growing up toward the center. In this way you don't have to move data when a new key is added or keys & values when one value grows.

Assuming that your inserts in the block are already sorted (e.g. flush "memstore" to disk), in this way you can even compress keys & values with different algorithms.

...but, with the fixed size block and values starting from the end you need to allocate a full block.

In contrast, you can "stream" your data and flush it, just few kb of data at the time. In this case you don't have to allocate a full block, but you've lost the ability to use different compressions for keys & values and the ability to keep in memory only the keys without doing memcpys.

Another possibility is traverse the data twice, the first time writing the keys and the second time writing the values. In this case you gained the different compression features but if you're using compression you're not able to stop after a specified block size threshold because you don't know how much space each key & value takes.

Any comments, thoughts, ideas?

Moved to Stockholm & Back to Music and Data!

2012-02-26T04:39:00.001-08:00

A couple of weeks ago I've started my new job at Spotify AB (Stockholm, Sweden).

The last two years at Develer (Florence, Italy) were fantastic, great environment, great people, great conferences PyCon4, Euro Python, QtDay, and I've to thank especially Giovanni Bajo, Lorenzo Mancini, Simone Roselli and Francesco Pallanti (AriaZero), and many more, for all the support in these two years. Thanks again guys!

...but now I'm here, new company, new country, new language (funny language) and new challenges.

Stockholm is beautiful and is not that cold as I had imagined, (even with -18 degrees celsius), but I'm still don't able to find good biscuits, how can you live without biscuits?

Since my new job is more about networking & data, new blog post will be slightly different, from the previous ones... less ui oriented and more data & statistic oriented.

Keep a look at interesting meetup in stockholm, I will be there. (Next one is Python Stockholm).

...And don't forget to use Spotify (Love, Discover, Share Music)!

AesFS - Encrypted File-System layer

2012-02-05T10:15:00.000-08:00

Last week I've spent a couple of hours playing with OpenSSL and CommonCrypto, and the result is a tiny layer on top the file-system to encrypt/decrypt files using AES.

The source code is available on my github at misc-common/aesfs. To build just run 'python build.py' and the result is placed in ./build/aesfs/ and ./build/aespack/. AesPack is a command line utility to pack and unpack single files, while aesfs is a fuse file-system.

You can run aesfs with:

aesfs -o root=/my/plaindir /mnt/encrypted

Since AesFS is on top your file-system you've to specify (with -o root) the directory where you want to store files, while the /mnt/ is the mount point where you can read/write your files in clear.

Files are written in blocks of 4k (8 bytes of header, 16 extra bytes for aes, and 4072 bytes of data). Check block.h and block.c for more information.

Drawing Charts with Python & Qt4

2011-11-19T08:59:00.000-08:00

Back again with some graphics stuff that can be useful in monitoring applications.
As you can immagine, this charts are 100% QPainter. The source code is available on blog-code@github.

Pie Chart

table = DataTable()
table.addColumn('Lang')
table.addColumn('Rating')
table.addRow(['Java', 17.874])
table.addRow(['C', 17.322])
table.addRow(['C++', 8.084])
table.addRow(['C#', 7.319])
table.addRow(['PHP', 6.096])

chart = PieChart(table)
chart.save('pie.png', QSize(240, 240), legend_width=100)

The usage is really simple, you have to create your table with columns and data, create a chart object using the table that you've created and you can call the draw() or save() method to show/save the chart somewhere.

Scattered Chart

chart = ScatterChart(table)
chart.haxis_title = 'Proc input'
chart.haxis_vmin = 0
chart.haxis_vmax = 16
chart.haxis_step = 2
chart.vaxis_title = 'Quality'
chart.vaxis_vmin = 90
chart.vaxis_vmax = 104
chart.vaxis_step = 1

You can customize the min/max value and the step of horizontal and vertical axis, ore you can use the default calculated on your data. You can also set the Reference column with setHorizontalAxisColumn() or setVerticalAxisColumn().

Area Chart

table = DataTable()
table.addColumn('Time')
table.addColumn('Site 1')
table.addColumn('Site 2')
table.addRow([ 4.00, 120,   500])
table.addRow([ 6.00, 270,   460])
table.addRow([ 8.30, 1260, 1120])
table.addRow([10.15, 2030,  540])
table.addRow([12.00,  520,  890])
table.addRow([18.20, 1862, 1500])

chart = AreaChart(table)
chart.setHorizontalAxisColumn(0)
chart.haxis_title = 'Time'
chart.haxis_vmin = 0.0
chart.haxis_vmax = 20.0
chart.haxis_step = 5

chart.save('area.png', QSize(400, 240), legend_width=100)

Line Chart

chart = LineChart(table)
chart.setHorizontalAxisColumn(0)
chart.haxis_title = 'Time'
chart.haxis_vmin = 0.0
chart.haxis_vmax = 20.0
chart.haxis_step = 2

Once again, the code is available on github at blog-code/qt4-charts/chart.py.

Don't miss the Florence Qt Day 2012

2011-11-19T07:54:00.001-08:00

The conference will take place on 27/28 January 2012, AC Hotel Firenze Porta al Prato (Florence, Italy). And it is Free!

The Qt Project
Qt 5.0
Qt Quick
Qt WebKit
Performance & Profiling
Qt in Use
...And many more

Take a look at http://www.qtday.it for more information.

RaleighFS to enter in the in-memory key-value store market

2011-07-30T23:52:00.000-07:00

A couple of guys asked me about RaleighFS, and why is called File-System instead of Database, and the answer is that the project is started back in 2005 as a simple Linux Kernel File-System, to evolve in something different.

Abstract Storage Layer
I like to say that RaleighFS is an Abstract Storage Layer, because is main components are designed to be plugable. For example the namespace can be flat or hierarchical, and the other objects don't feel the difference.

Store Multiple Objects with different Types (HashTable, SkipList, Tree, Extents, Bitmap, ...)
Each Object as it's own on-disk format (Log, B*Tree, ...).
Observable Objects - Get Notified when something change.
Flexible Namespace & Semantic for Objects.
Various Plain-Text & Binary Protocol Support (Memcache, ...)

A New Beginning...
Starting weeks ago, I've decided to rewrite and refactor a bit of code, stabilize the API and, this time, trying to bring the file-system and the network layer near to a stable release.
First Steps are:

Release a functional network layer as soon as I can.
Providing a pluggable protocol interface.
Implement a memcache capable and other protocols.

So, these first steps are all about networking, and unfortunately, this means dropping the sync part and keep just the the in-memory code (the file-system flush on memory pressure).

Current Status:
Starting from today, some code is available on github under raleighfs project.

src/zcl contains the abstraction classes and some tool that is used by every piece of code.
src/raleighfs-core contains the file-system core module.
src-raleighfs-plugins contains all the file-system's pluggable objects and semantics layers.
src/raleigh-server currently contains the entry point to run a memcache compatible (memccapable text protocol), and a redis get/set interface server. The in-memory storage is relegated in engine.{h,c} and is currently based on a Chained HashTable or a Skip List or a Binary Tree.

How it Works
As I said before the entry point is the ioloop, that allows clients to interactot through a specified protocol with the file-system's objects. Each "protocol handler" parse it's own format, convert it to the file-system one, and enqueue the request to a RequestQ that dispatch the request to the file-system. When the file-system has the answer push the response into the RequestQ and the client is notified. The inverse process is applied the file-system protocol is parsed and converted into the client one.

Having the RequestQ has a couple advantages, the first one is that you can wrap easily a protocol to communicate with the filesystem, the other one is that the RequestQ can dispatch the request to different servers. Another advantage is that the RequestQ can operate as a Read-Write Lock for each object allowing the file-system to have less lock...

For more information ask me at theo.bertozzi (at) gmail.com.

Getting started with Avro

2011-02-18T09:50:00.000-08:00

Apache Avro is a language-neutral data serialization system. It's own data format can be processed by many languages (currently C, C++, Python, Java, Ruby and PHP).

Avro provides:

Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).

Data Schema
Avro relies on schemas, that specifies which fields and types an object is made. In this way, each datum is written with no per-value overhead. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program.

Avro has traditional Primitive Types like int, float, string, ... and other Complex Types like enum, record, array, ... You can use this types to create your own complex types like in example below:

{
  "type": "record", 
  "name": "Person", 
  "fields": [
      {"name": "name", "type": "string"},
      {"name": "age", "type": "int"},
      {"name": "emails", "type": {"type": "array", "values": "string"}},

      {"name": "father", "type": "Person"},
      {"name": "mother", "type": "Person"},
  ]
}

You can create schemes by code using the Schema class methods, or just parsing a json file using the Schema.parse() method.

Reading & Writing
Once that you've written the schema you can start to serialize your objects, generating the right data structure for your types.

An example of serialization for the schema written above, can be something like:

public GenericData.Record serialize() {
  GenericData.Record record = new GenericData.Record(schema);

  record.put("name", this.name);
  record.put("age", this.age);

  int nemails = this.mails.length();
  GenericData.Array emails = new GenericData.Array(nemails, emails_schema);
  for (int i = 0; i < nemails; ++i)
     record.put(this.mails[i]);
  record.put("emails", emails);

  record.put("father", this.father.serialize());
  record.put("mother", this.mother.serialize());
}

The same code written in python looks like this:

def serialize(self):
  return { 'name', self.name, 
           'age', self.age, 
           'emails': self.mails, 
           'father': self.father.serialize()
           'mother': self.mother.serialize()
         }

Now that you've the Avro Object that reflect the schema, you've just to write it. To do that you've to use a DatumWriter that uses an Encoder to write the datum on an OutputStream.

...
Encoder encoder = BinaryEncoder(outputStream); 
GenericDatumWriter datumWriter = new GenericDatumWriter(schema);
datumWriter.write(person.serialize(), encoder);
...

As you can imagine, reading data is quite similar. Pick up a DatumReader and a Decoder that can read from an InputStream and start reading your Avro Objects.

...
Decoder decoder = BinaryDecoder(inputStream); 
GenericDatumReader datumReader = new GenericDatumReader(schema);

GenericData.Record record = new GenericData.Record(schema);
while (...) {
  datumWriter.read(record, decoder);
  // record.get("name")
  // record.get("...");
}
...

Object Container Files
Avro includes an object container file format. A file has a schema, and all objects stored in the file must be written according to that schema.

Since Avro is designed to be used with Hadoop and Map-Reduce, the file format is similar to the Hadoop' SequenceFile. Objects are grouped in blocks and each block may be compressed. Between each block a sync-marker is added to allow an efficient splitting of files.

void testWrite (File file, Schema schema) throws IOException {
   GenericDatumWriter datum = new GenericDatumWriter(schema);
   DataFileWriter writer = new DataFileWriter(datum);

   writer.setMeta("Meta-Key0", "Meta-Value0");
   writer.setMeta("Meta-Key1", "Meta-Value1");

   writer.create(schema, file);
   for (Person p : people)
      writer.append(p.serialize());

   writer.close();
}

Since the File contains the schema, you don't need to specify a schema for the reader. You can extract the used by calling the getSchema() method of the reader.

void testRead (File file) throws IOException {
   GenericDatumReader datum = new GenericDatumReader();
   DataFileReader reader = new DataFileReader(file, datum);

   GenericData.Record record = new GenericData.Record(reader.getSchema());
   while (reader.hasNext()) {
     reader.next(record);
     System.out.println("Name " + record.get("name") + 
                        " Age " + record.get("age"));
   }

   reader.close();
}

For a more "advanced" reading operation, (see the File Evolution example), you can specify the expected file schema.

Data Evolution
The first problem that you'll encounter working with a custom binary format, or even using an XML/JSON based, is to deal with data evolution.

During your application development you will surely have to add, remove or rename some fields from the various data structures. To solve the compatibility problem you've to introduce a "versioning step", that transforms your "Version X" document to a "Version (X + 1)" format.

Avro has this problem solved applying some Schema Resolution rules. In brief...

If a field is added, old document doesn't contains the field and the new readers uses the default value, specified in the schema.
If a field is removed, new readers doesn't read it and no one cares if it's present or not.
If a field is renamed, new schema must contains the old name of the field in it's alias list.

Note, that only the forward compatibility is covered.

Inter-Process Calls
Avro makes RPC really easy, with a few lines of code you can write a full client/server.

First of all, you need to write your protocol schema, specifying each messasge with request and response objects.

{
  "namespace": "test.proto",
  "protocol": "Test",

  "messages": {
    "xyz": {
        "request": [{"name": "key", "type": "string"}],
        "response": ["bytes", "null"]
    }
  }
}

A simple java server can be written in this way. You define the respond method, handling each message, and for each message you return the related response object.

static class Responder extends GenericResponder {
  public Object respond (Protocol.Message message, Object request) {
    String msgName = message.getName();

    if (msgName == "xyz") {
      // Make a response for 'xyz' get data from request 
      // GenericData.Record record = (GenericData.Record)request;
      // e.g. record.get("key")
      return(response_obj);
    }

    throw new AvroRuntimeException("unexcepcted message: " + msgName);
  }
}

public static void main (String[] args) 
  throws InterruptedException, IOException
{
  Protocol protocol = Protocol.parse(new File("proto-schema.avpr"));
  HttpServer server = new HttpServer(new Responder(protocol), 8080);
  server.start();
  server.join();
}

The client connects to the server, and send the message with the specified schema format.

def main():
  proto = protocol.parse(file('proto-schema.avpr').read())

  client = ipc.HTTPTransceiver('localhost', 8080)
  requestor = ipc.Requestor(proto, client)
  result = requestor.request('xyz', {'key': 'Test Key'})
  print result

  client.close()

I've made a couple of examples (written in C, Java and Python) that shows how to use Avro Serialization and Avro IPC. Sources are available on my github at avro-examples.

Linux cgroups: Memory Threshold Notifier

2011-02-12T11:50:00.000-08:00

Through cgroups Notification API you can be notified about changing status of a cgroup. Memory cgroup implements memory thresholds using cgroups notification API, It allows to register multiple memory and memsw thresholds and gets notifications when it crosses.

This can be very useful if you want to maintain a cache but you don't want to exceed a certain size of memory.

To register a threshold application need:

create an eventfd using eventfd(2);
open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>"

...
// Open cgroup file (e.g. "memory.oom_control" or "memory.usage_in_byte")
snprintf(path, PATH_MAX, "%s/%s", cgroup, file);
cgroup_ctrl->fd = open(path, O_RDONLY);
cgroup_ctrl->efd = eventfd(0, 0);

// Prepare ctrl_string e.g.
// <event_fd> <fd of memory.oom_control>
// <event_fd> <fd of memory.usage_in_bytes> <threshold>
snprintf(ctrl_string, CTRL_STRING_MAX,
         "%d %d", cgroup_ctrl->efd, cgroup_ctrl->fd);

// Write ctrl_string to cgroup event_control
snprintf(path, PATH_MAX, "%s/cgroup.event_control", cgroup);
fd = open(path, O_WRONLY);
write(fd, ctrl_string, strlen(ctrl_string));
close(fd);
...

Now you can add eventfd() descriptor to your epoll()/select() event loop and wait your notification. Here, you can handle your cache release.

...
nfds = epoll_wait(epfd, events, NEVENTS, TIMEOUT);
for (i = 0; i < nfds; ++i) {
    if (events[i].data.fd == cgroup_ctrl->efd) {
        /* Memory Threshold Notification */
        read(cgroup_ctrl->efd, &result, sizeof(uint64_t));

        /* free some memory */
    }
}
...

A full demo source code is avable on github at cgroup-mem-threshold demo.

HBase I/O: HFile

2011-02-09T13:18:00.000-08:00

In the beginning HBase uses MapFile class to store data persistently to disk, and then (from version 0.20) a new file format is introduced. HFile is a specific implementation of MapFile with HBase related features.

HFile doesn't know anything about key and value struct/type (row key, qualifier, family, timestamp, …). As Hadoop' SequenceFile (Block-Compressed), keys and values are grouped in blocks, and blocks contains records. Each record has two Int Values that contains Key Length and Value Length followed by key and value byte-array.

HFile.Writer has only a couple of append overload methods, one for KeyValue class and the other for byte-array type. As for SequenceFile, each key added must be greater than the previous one. If this condition is not satisfied an IOException() is raised.

By default each 64k of data (key + value) records are squeezed together in a block and the block is written to the HFile OutputStream with the specified compression, if specified. Compression Algorithm and Block size are both (long)constructor arguments.

One thing that SequenceFile is not good at, is adding Metadata. Metadata can be added to SequenceFile just from the constructor, so you need to prepare all your metadata before creating the Writer.

HFile adds two "metadata" type. One called Meta-Block and the other called FileInfo. Both metadata types are kept in memory until close() is called.

Meta-Block is designed to keep large amount of data and its key is a String, while FileInfo is a simple Map and is preferred for small information and keys and values are both byte-array. Region-server' StoreFile uses Meta-Blocks to store a BloomFilter, and FileInfo for Max SequenceId, Major compaction key and Timerange info.

On close(), Meta-Blocks and FileInfo is written to the OutputStream. To speedup lookups an Index is written for Data-Blocks and Meta-Blocks, Those indices contains n records (where n is the number of blocks) with block information (block offset, size and first key).
At the end a Fixed File Trailer is written, this block contains offsets and counts for all the HFile Indices, HFile Version, Compression Codec and other few information.

Once the file is written, the next step is reading it. You've to start by loading FileInfo, the loadFileInfo() of HFile.Reader loads in memory the Trailer-block and all the indices, that allows to easily query keys. Through the HFileScanner you can seek to a specified key, and iterate over.

The picture above, describe the internal format of HFile...

Zero-Copy Memory

2011-01-30T09:32:00.000-08:00

Looking at my code paths, I've seen many methods that read from a data-structure to a buffer and then write back it to another data structure, and so on...

For example read from disk, write to a block cache, read a piece of data and put it in another other data-structure for processing.

One way to avoid all this memcpy and data duplication is to use a copy-on-write data-structure. For strings or "single-block-data" is pretty easy. As you can see on picture above, you've a chunk of memory that is referenced from different objects. Each object keeps a pointer to the data and if needed an offset and a length to reference just a part of the data.

If someone call write() on one of those objects, internally data-blob is duplicated and modified. In this way the modified objects points to a new block and the others points to the old block.

This is a really simple copy-on-write technique and the string class is really suitable for that. But looking at my code, this case doesn't happen often... In the common case, I've object that references two or more blocks and sometimes happens that i need to remove or inject data in the middle of the buffer.

Extent-tree that i use for the Flow Objects in RaleighFS plus copy-on-write on blocks has the behavior that I need to avoid all this memcpy and data duplication.

Extent-Tree allows you to read from a list of blocks as a single block. Each block as an offset and a length, and blocks are sorted by offset, in the tree. This allows you to reach your specified offset fast and insert or remove at any specified offset.

Using this extent-tree has the advantage that you can avoid to memcpy large blocks from a data-structure to another, you can merge two extent tree or do other fancy thing without allocate new memory and copy data to it. But insert and split require a malloc, so you've to tune your node and extent allocation with an object pool, and this is pretty simple due to the fixed size of those object.

I've made a draft implementation that you can find on my github repo.

Hadoop I/O: Sequence, Map, Set, Array, BloomMap Files

2011-01-16T03:44:00.000-08:00

Hadoop' SequenceFile provide a persistent data structure for binary key-value pairs. In contrast with other persistent key-value data structures like B-Trees, you can't seek to a specified key editing, adding or removing it. This file is append-only.

SequenceFile has 3 available formats: An "Uncompressed" format, A "Record Compressed" format and a "Block-Compressed". All of them share a header that contains a couple of information that allows the reader to recognize is format. There're Key and Value Class Name that allows the Reader to instantiate those classes, via reflection, for reading. The version number and format (Is Compressed, Is Block Compressed), if compression is enabled the Compression Codec class name field is added to the header.

The sequence file also can contains a "secondary" key-value list that can be used as file Metadata. This key-value list can be just a Text/Text pair, and is written to the file during the initialization that happens in the SequenceFile.Writer constructor, so you can't edit your metadata.

As seen Sequence File has 3 available formats, the "Uncompressed" and the "Record Compressed" are really similar. Each call to the append() method adds a record to the sequence file the record contains the length of the whole record (key length + value length), the length of the key and the raw data of key and value. The difference between the compressed and the uncompressed version is that the value raw data is compressed, with the specified codec, or not.
In contrast the "Block-Compressed" format is more compression-aggressive. Data is not written until it reach a threshold, and when the threshold is reached all keys are compressed together, the same happens for the values and the auxiliary lists of key and value lengths.
As you can see in the figure on the left, a block record contains a VInt with the number of the buffered records and 4 compressed blocks that contains a list with the length of the keys, the list of keys, another list with the length of the values and finally the list of values. Before each block a sync marker is written.

Hadoop SequenceFile is the base data structure for the other types of files, like MapFile, SetFile, ArrayFile and BloomMapFile.

The MapFile is a directory that contains two SequenceFile: the data file ("/data") and the index file ("/index"). The data contains all the key, value records but key N + 1 must be greater then or equal to the key N. This condition is checked during the append() operation, if checkKey fail it throws an IOException "Key out of order".

The Index file is populated with the key and a LongWritable that contains the starting byte position of the record. Index does't contains all the keys but just a fraction of the keys, you can specify the indexInterval calling setIndexInterval() method. The Index is read enteirely into memory, so if you've large map you can set a index skip value that allows you to keep in memory just a fraction of the index keys.

SetFile and ArrayFile are based on MapFile, and their implementation are

just few lines of code. The SetFile instead of append(key, value) as just the key field append(key) and the value is always the NullWritable instance. The ArrayFile as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1. The BloomMapFile extends the MapFile adding another file, the bloom file "/bloom", and this file contains a serialization of the DynamicBloomFilter filled with the added keys. The bloom file is written entirely during the close operation.

If you want to play with SequenceFile, MapFile, SetFile, ArrayFile without using Java, I've written a naive implementation in python. You can find it, in my github repository python-hadoop.

[TIP] Daily Repository Diff via Mail

2011-01-08T20:52:00.000-08:00

If your 200 mail every morning are still not enough, you can add this script to your daily cron.

The following scripts that you can find here, allows you to send diffs of repositories that you follow. The bash scripts below allows you to update your git/svn/hg repository and keep a diff in a plain and html format (using ansi2html).

git_diff() {
    cd $repo_url/$1

    git_repo_url=`git remote show origin | grep "Fetch URL" | cut -d ' ' -f 5-`
    echo "GIT Diff $1 ($2) - $git_repo_url"

    git fetch
    git diff --color HEAD origin/HEAD | $ansi2html > $diff_dir/$2.html
    git diff HEAD origin/HEAD > $diff_dir/$2.diff
    git merge origin/HEAD
}

hg_diff() {
    cd $repo_url/$1

    hg_repo_url=`hg showconfig | grep paths\.default | cut -d '=' -f 2-`
    echo "HG Diff $1 ($2) - $hg_repo_url"

    hg incoming --patch --git | $ansi2html > $diff_dir/$2.html
    hg incoming --patch --git  > $diff_dir/$2.diff
    hg pull -u
}

svn_diff() {
    cd $repo_url/$1

    svn_repo_url=`svn info | grep URL | cut -d ' ' -f 2-`
    svn_repo_rev=`svn info | grep "Last Changed Rev" | cut -d ' ' -f 4-`
    echo "SVN Diff $1 ($2) - $svn_repo_url"

    svn di $svn_repo_url -r$svn_repo_rev | $ansi2html > $diff_dir/$2.html
    svn di $svn_repo_url -r$svn_repo_rev > $diff_dir/$2.diff
    svn up
}

# Fetch my repos (xxx_diff repo_path diff_name)
git_diff "linux/linux-2.6" "linux-2.6"
svn_diff "apache/lucene" "lucene"
hg_diff "java/jdk" "hotspot-jdk7"

After running repo-diff script that allows you to update your favorites repositories and saving diff files, you can send them using he send-mail script.

diff_dir="~/.repo-diffs"
mail_address="th30z@localhost"

for html_file in `ls -1 $diff_dir/*.html` ; do
    repo_name=`basename $html_file | sed 's/\.html$//g'`
    diff_file=`echo $html_file | sed 's/\.html$/\.diff/g'`

    boundary="==`echo $repo_name | md5sum | cut -d ' ' -f -1`"
    alt_boundary="==`echo $boundary | md5sum | cut -d ' ' -f -1`"

    echo "Send Repo Diff $repo_name - $html_file"
    (
        echo "MIME-Version: 1.0"
        echo "Subject: Repo-Diff: $repo_name"
        echo "To: $mail_address"
        echo "Content-Type: multipart/mixed; boundary=$boundary"

        echo "--$boundary"
        echo "Content-Type: multipart/alternative; boundary=$alt_boundary"
        echo

        echo "--$alt_boundary"
        echo "Content-Type: text/plain"
        echo
        cat $diff_file

        echo "--$alt_boundary"
        echo "Content-Type: text/html"
        echo
        cat $html_file

        echo
        echo "--$alt_boundary--"
        echo "--$boundary"
        echo "Content-Type: Application/Binary_Attachment; 
                            name=\"`basename $diff_file`\""
        echo "Content-Disposition: attachment; 
                                   filename=\"`basename $diff_file`\""
        echo "Content-Transfer-Encoding: uuencode"
        echo
        uuencode $diff_file $diff_file
    ) | sendmail $mail_address
done

This script, for each file generated from the repo-diff script sends you a mail with the diff as body and attachment.

Scripts are available on my github repository under blog-code/repo-mail-diff.

https://github.com/matteobertozzi/blog-code.git

Disk Failure, Raid Corruption and CRCs

2010-12-29T11:00:00.000-08:00

This week is started with a raid corruption due to a disk failure. Disk has continued to work without notifying anyone about the corruption, but this is somehow acceptable. The inacceptable thing is that the file-system is not aware about its data state. The "Old File-Systems" are not interested in user data and even with journaling, user data is not a priority.

As a Spare-Time File-System Developer, is really funny saying "With my File-System, This failure would not have happened!"

Simple Solution: B-Tree and CRCs
A really simple way to solve this problem is adding CRC to user data, or better (from file-system point of view) adding a CRC on every block. If the file-system is entirely based on B-Tree (when I say B-Tree, I mean B*Tree or B+Tree) you can simple add CRC for the block in the node header, and CRC for the child in the blocks pointer.

Excluding from a moment the super-block and other things... You start reading the root of the tree (1st block), if node crc check fail there's something wrong with your disk-read (or maybe your memory, but this is less probable). When you start traversing the tree the check is even better and "secure".

Checking the CRC on the block read is good, but having the CRC stored in a different location, and read at a different time is even better.

Storing CRC of the child node within the pointer gives you the ability to do a double integrity check. When data becomes larger and cannot be stored in a tree node, you can do the same thing with the extents (Pointers and CRC).

Another maybe more "intrusive" solution is storing a CRC for every N bytes of data, but I think that this more acceptable for a user-space "crc-fs" implementation (This approach is implemented in Hadoop's ChecksumFileSystem class).

Yesterday night I've implemented a quick and dirty B+Tree, it's not tuned for speed but for a didactic view. It has a couple of nice features like Pointers with CRC and Variable Length nodes to be able to compress nodes on disk. You can find the source code on my github repository (B+Tree Code).

Cocoa: Black Window Titlebar

2010-11-06T10:22:00.000-07:00

Starting with Quicktime X and now with Facetime, Apple has introduced a new black ui titlebar. There's no NS component at the moment or any flag of NSWindow, so I've decided to throw away ten minutes to write my own black titlebar.

I've created a NSWindow class (BlackWindow.h/BlackWindow.m) that initializes a window without borders (styleMask NSBorderlessWindowMask) and I've created a NSView that draws the titlebar (BlackWindowTitleView.h/BlackWindowTitleView.m).
The titlebar redraws itself when some changes are applied to the window. With key value observing for title, document edited, ... and NSNotificationCenter for KeyNotifications the titlebar knows what to do.

Source code can be found on my GitHub under blog-code/MacBlackWindow.

git clone https://github.com/matteobertozzi/blog-code.git

Python: Inverted Index for dummies

2010-10-15T22:07:00.000-07:00

An Inverted Index is an index data structure storing a mapping from content, such as words or numbers, to its document locations and is generally used to allow fast full text searches.

The first step of Inverted Index creation is Document Processing In our case is word_index() that consist of word_split(), normalization and the deletion of stop words ("the", "then", "that"...).

def word_split(text):
    word_list = []
    wcurrent = []
    windex = None

    for i, c in enumerate(text):
        if c.isalnum():
            wcurrent.append(c)
            windex = i
        elif wcurrent:
            word = u''.join(wcurrent)
            word_list.append((windex - len(word) + 1, word))
            wcurrent = []

    if wcurrent:
        word = u''.join(wcurrent)
        word_list.append((windex - len(word) + 1, word))

    return word_list

word_split() is quite a long function that does a really simple job split words. You can rewrite it with just one line using something like re.split('\W+', text).

def words_cleanup(words):
    cleaned_words = []
    for index, word in words:
        if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:
            continue
        cleaned_words.append((index, word))
    return cleaned_words

def words_normalize(words):
    normalized_words = []
    for index, word in words:
        wnormalized = word.lower()
        normalized_words.append((index, wnormalized))
    return normalized_words

Cleanup and Normalize are just to function filters to apply after word_split(). In this case cleanup remove the stopwords (frozenset of strings) and normalize convert word in lower case. But you can do something more like removing accents, transform to singular or something else.

def word_index(text):
    words = word_split(text)
    words = words_normalize(words)
    words = words_cleanup(words)
    return words

word_index() is just an helper, take an input text and does all the word splitting/normalization job and the result is a list of tuple that contains position and word. [(1, u'niners'), (13, u'coach')].

def inverted_index(text):
    inverted = {}

    for index, word in word_index(text):
        locations = inverted.setdefault(word, [])
        locations.append(index)

    return inverted

Finally we've our invertex_index() method that take a text as input and returns a dictionary with words as keys and locations (position of the words in the document) as values.

def inverted_index_add(inverted, doc_id, doc_index):
    for word, locations in doc_index.iteritems():
        indices = inverted.setdefault(word, {})
        indices[doc_id] = locations
    return inverted

The Previous method, inverted_index(), returns a dictionary with just the information for the specified document. So inverted_index_add() add Document's inverted index to a Multi-Document Inverted Index. Here we've words that are keys of the dictionary and values are dictionary with doc_id as key and document location as values. {'week':{'doc2': [149], 'doc1': [179, 187]}}.

def search(inverted, query):
    words = [word for _, word in word_index(query) if word in inverted]
    results = [set(inverted[word].keys()) for word in words]
    return reduce(lambda x, y: x & y, results) if results else []

Now that we've the inverted index, we're able to do queries on it. The function above takes a Multi-Document Inverted index and a query, and returns the set of documents that contains all the words that you've searched.

Obviously to make a serious search function you need to add ranking, phrase matches, stemming and other features of full-text search. And you need to write your dictionary using an on disk btree. But this is a basis example, of how to build an inverted index.

Source code can be found on my GitHub repository under py-inverted-index:

git clone http://github.com/matteobertozzi/blog-code.git

Unsigned Int of specified bit length

2010-10-09T08:35:00.000-07:00

Finally I've added 128+ bit support to RaleighFS Table Object Row Index. I don't need much operation on indices only inc, dec and compare, but I've implemented other couple of methods (add, and, or, xor) and now there's a mini uintX library available at GitHub.

...
uint8_t index[16];    // 128bit unsigned integer

uintx_init(index, 128U);   // Initialize 128bit index to zero.

uintx_inc(index, 128U);    // Increment Index (Now is One)
uintx_dec(index, 128U);    // Decrement Index (Now is Zero)

uintx_from_u64(index, 128U, 8192U);    // Now index is 8192

uintx_add64(index, 128U, 5U);          // Add 5 to index

uintx_compare(index, 128U, 8197U);     // Return 0
uintx_compare(index, 128U, 9197U);     // Return -1
uintx_compare(index, 128U, 0U);        // Return 1
...

The API is quite simple pass your object (uint8_t vector) and its bit size (uint8_t[16] is 128bit) and other args needed to the method. Of course you can replace 128 with 256, 512 or everything else.

The source code can be found at my GitHub in the uintx folder:

http://github.com/matteobertozzi/carthage.git

iOS4: Core Motion Viewer

2010-09-25T01:50:00.000-07:00

I'm playing with Core Motion (Accelerometer and Gyroscope) of my new iPod Touch 4th. And I've written a simple app to look at the values of Motion Sensors. Code is not so much interesting, but it an useful app to check motion values.

Source Code can be found on GitHub at:

http://github.com/matteobertozzi/blog-code/tree/master/MotionViewer/

Python: NetStack/CoreAsync

2010-09-19T01:39:00.000-07:00

Today I've added to my GitHub Respositories NetStack/CoreAsync a python "package" (It's much more a bunch of utility classes) that allows you to code in async/parallel way, that I use to build my networking apps.

def concurrent_func(text):
    for i in range(5):
        print text, 'STEP', i
        yield

coreasync.dispatch_concurrent(lambda: concurrent_func("Context 1"))
coreasync.dispatch_concurrent(lambda: concurrent_func("Context 2"))
coreasync.runloop()

Package contains a small Async HTTP Server implementation that you can easily use:

def handle_response(socket, addr, request, headers, body):
   yield socket.send(...)

def handle_error(socket, addr, error):
   yield socket.send(...)

coreasync.httpserver.httpServerLoop(HOST, PORT, handle_response, handle_error)
print 'HTTP Server is Running on', HOST, PORT
coreasync.runloop()

You can find Source Code and Examples at GitHub:

git clone http://github.com/matteobertozzi/netstack-coreasync.git

Qt4 Http Request Parser

2010-08-15T00:04:00.002-07:00

Qt 4.4 introduces QNetworkRequest and QNetworkAccessManager to help you with your HTTP client request. But if you want parse an HTTP Request, because you're writing an HTTP server, it seems that there's nothing already done (comment here, if I've missed it).

So, this morning I've written a little class that help you with HTTP Request parse:

static HttpRequest fromData (const QByteArray& data);
static HttpRequest fromStream (QIODevice *buffer);

There's a couple of method that you can use to parse your request data, and some others method to retrieve headers, method type, url and body data.

Qt is used only for QHash, QByteArray and QIODevice classes, you've to implement all the "socket" logic, to handle your client request and response.

The Source code is available here: Qt4 Http Request Parser Source Code.

Self Contained C Data Structures and Utils Repository

2010-08-13T23:40:00.002-07:00

I've rewritten a couple of data structure and utils (C lang) that I use for my RaleighFS Project, and I've released it at github under BSD License.

At the moment there are some data structures and helper functions:

Memory Pool - Simple Memory Allocator.
Memory Utils - memcpy(), memset(), memcmp(), memswap() with some variants.
Memory Block - copy-on-write block.
Hashed Containers: Hashtable, Set, Bag.
Queue - FIFO/Circular Queue.
Cache - LRU/MRU Cache.
WorkQueue - Lightweight pthread workqueue.
Sort/Merge functions.

Every "object" is self contained, you can grab the "hashtable.c/hashtable.h" file to be ready to use the hashtable, no other dependence is required. And if data structure require a malloc() you can specify your own allocator during object initialization.

git clone http://github.com/matteobertozzi/carthage.git