Friday, October 15, 2010

Python: Inverted Index for dummies

An Inverted Index is an index data structure storing a mapping from content, such as words or numbers, to its document locations and is generally used to allow fast full text searches.



The first step of Inverted Index creation is Document Processing In our case is word_index() that consist of word_split(), normalization and the deletion of stop words ("the", "then", "that"...).
def word_split(text):
    word_list = []
    wcurrent = []
    windex = None

    for i, c in enumerate(text):
        if c.isalnum():
            wcurrent.append(c)
            windex = i
        elif wcurrent:
            word = u''.join(wcurrent)
            word_list.append((windex - len(word) + 1, word))
            wcurrent = []

    if wcurrent:
        word = u''.join(wcurrent)
        word_list.append((windex - len(word) + 1, word))

    return word_list
word_split() is quite a long function that does a really simple job split words. You can rewrite it with just one line using something like re.split('\W+', text).
def words_cleanup(words):
    cleaned_words = []
    for index, word in words:
        if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:
            continue
        cleaned_words.append((index, word))
    return cleaned_words

def words_normalize(words):
    normalized_words = []
    for index, word in words:
        wnormalized = word.lower()
        normalized_words.append((index, wnormalized))
    return normalized_words
Cleanup and Normalize are just to function filters to apply after word_split(). In this case cleanup remove the stopwords (frozenset of strings) and normalize convert word in lower case. But you can do something more like removing accents, transform to singular or something else.
def word_index(text):
    words = word_split(text)
    words = words_normalize(words)
    words = words_cleanup(words)
    return words
word_index() is just an helper, take an input text and does all the word splitting/normalization job and the result is a list of tuple that contains position and word. [(1, u'niners'), (13, u'coach')].
def inverted_index(text):
    inverted = {}

    for index, word in word_index(text):
        locations = inverted.setdefault(word, [])
        locations.append(index)

    return inverted
Finally we've our invertex_index() method that take a text as input and returns a dictionary with words as keys and locations (position of the words in the document) as values.
def inverted_index_add(inverted, doc_id, doc_index):
    for word, locations in doc_index.iteritems():
        indices = inverted.setdefault(word, {})
        indices[doc_id] = locations
    return inverted
The Previous method, inverted_index(), returns a dictionary with just the information for the specified document. So inverted_index_add() add Document's inverted index to a Multi-Document Inverted Index. Here we've words that are keys of the dictionary and values are dictionary with doc_id as key and document location as values. {'week':{'doc2': [149], 'doc1': [179, 187]}}.
def search(inverted, query):
    words = [word for _, word in word_index(query) if word in inverted]
    results = [set(inverted[word].keys()) for word in words]
    return reduce(lambda x, y: x & y, results) if results else []
Now that we've the inverted index, we're able to do queries on it. The function above takes a Multi-Document Inverted index and a query, and returns the set of documents that contains all the words that you've searched.

Obviously to make a serious search function you need to add ranking, phrase matches, stemming and other features of full-text search. And you need to write your dictionary using an on disk btree. But this is a basis example, of how to build an inverted index.

Source code can be found on my GitHub repository under py-inverted-index:
git clone http://github.com/matteobertozzi/blog-code.git

Saturday, October 9, 2010

Unsigned Int of specified bit length

Finally I've added 128+ bit support to RaleighFS Table Object Row Index. I don't need much operation on indices only inc, dec and compare, but I've implemented other couple of methods (add, and, or, xor) and now there's a mini uintX library available at GitHub.

...
uint8_t index[16];    // 128bit unsigned integer

uintx_init(index, 128U);   // Initialize 128bit index to zero.

uintx_inc(index, 128U);    // Increment Index (Now is One)
uintx_dec(index, 128U);    // Decrement Index (Now is Zero)

uintx_from_u64(index, 128U, 8192U);    // Now index is 8192

uintx_add64(index, 128U, 5U);          // Add 5 to index

uintx_compare(index, 128U, 8197U);     // Return 0
uintx_compare(index, 128U, 9197U);     // Return -1
uintx_compare(index, 128U, 0U);        // Return 1
...

The API is quite simple pass your object (uint8_t vector) and its bit size (uint8_t[16] is 128bit) and other args needed to the method. Of course you can replace 128 with 256, 512 or everything else.

The source code can be found at my GitHub in the uintx folder:
http://github.com/matteobertozzi/carthage.git

Saturday, September 25, 2010

iOS4: Core Motion Viewer

I'm playing with Core Motion (Accelerometer and Gyroscope) of my new iPod Touch 4th. And I've written a simple app to look at the values of Motion Sensors. Code is not so much interesting, but it an useful app to check motion values.

Sunday, September 19, 2010

Python: NetStack/CoreAsync

Today I've added to my GitHub Respositories NetStack/CoreAsync a python "package" (It's much more a bunch of utility classes) that allows you to code in async/parallel way, that I use to build my networking apps.
def concurrent_func(text):
    for i in range(5):
        print text, 'STEP', i
        yield

coreasync.dispatch_concurrent(lambda: concurrent_func("Context 1"))
coreasync.dispatch_concurrent(lambda: concurrent_func("Context 2"))
coreasync.runloop()

Package contains a small Async HTTP Server implementation that you can easily use:
def handle_response(socket, addr, request, headers, body):
   yield socket.send(...)

def handle_error(socket, addr, error):
   yield socket.send(...)

coreasync.httpserver.httpServerLoop(HOST, PORT, handle_response, handle_error)
print 'HTTP Server is Running on', HOST, PORT
coreasync.runloop()

You can find Source Code and Examples at GitHub:
git clone http://github.com/matteobertozzi/netstack-coreasync.git

Sunday, August 15, 2010

Qt4 Http Request Parser

Qt 4.4 introduces QNetworkRequest and QNetworkAccessManager to help you with your HTTP client request. But if you want parse an HTTP Request, because you're writing an HTTP server, it seems that there's nothing already done (comment here, if I've missed it).

So, this morning I've written a little class that help you with HTTP Request parse:

  • static HttpRequest fromData (const QByteArray& data);
  • static HttpRequest fromStream (QIODevice *buffer);

There's a couple of method that you can use to parse your request data, and some others method to retrieve headers, method type, url and body data.

Qt is used only for QHash, QByteArray and QIODevice classes, you've to implement all the "socket" logic, to handle your client request and response.

The Source code is available here: Qt4 Http Request Parser Source Code.

Friday, August 13, 2010

Self Contained C Data Structures and Utils Repository

I've rewritten a couple of data structure and utils (C lang) that I use for my RaleighFS Project, and I've released it at github under BSD License.

At the moment there are some data structures and helper functions:
  • Memory Pool - Simple Memory Allocator.
  • Memory Utils - memcpy(), memset(), memcmp(), memswap() with some variants.
  • Memory Block - copy-on-write block.
  • Hashed Containers: Hashtable, Set, Bag.
  • Queue - FIFO/Circular Queue.
  • Cache - LRU/MRU Cache.
  • WorkQueue - Lightweight pthread workqueue.
  • Sort/Merge functions.

Every "object" is self contained, you can grab the "hashtable.c/hashtable.h" file to be ready to use the hashtable, no other dependence is required. And if data structure require a malloc() you can specify your own allocator during object initialization.

Saturday, July 24, 2010

iPhone/Qt: UDP Voice using AudioQueue/QAudioOutput

Qt 4.6 Introduces QtMultimedia with support for raw audio input/output, while Mac OS X and iPhone OS has well known CoreAudio/AudioQueue from a long time.

This example show how to implement a really rudimentary VOIP between an iOS (iPhone/iPod Touch/iPad) and a Desktop (using Qt). To simplify the example is just unidirectional iOS to Qt, because doing Qt to iOS part is just the specular code.

The idea is very simple, Fetch the sample data from the microphone, and send over UDP. No compression no packet number just send raw data in simple way as possible.

On iOS we've AudioQueueNewInput() that creates a new recording audio queue object, and AudioQueueNewOutput() that creates a new playback audio queue object.

The specular Qt 4.6+ classes are QAudioInput class that provides an interface for receiving audio data from an audio input device, and QAudioOutput class that provides an interface for sending audio data to an audio output device. To handle audio samples we can easily wrap QIODevice read() and write() calls.

The source code is really easy. iOS code contains AudioRecorder.h/AudioRecorder.m that is a C code that wrap AudioQueue Input to record a stream, with a callback handler to intercept packet data. The Qt4 Source code contains just  a QIODevice Wrapper that read from UDP and prepare for QAudioOutputStream. It's really really simple, but gives you an idea.

The Source code is available here, and contains both Qt 4.6+ and iOS sources: iPhone/Qt UDP Voice Source Code.

Note: Due to a QTBUG-8878 we cannot use 8, 16, 24 KHz Frequency. Will be fixed in Qt 4.6.4.