Friday, October 15, 2010

Python: Inverted Index for dummies

An Inverted Index is an index data structure storing a mapping from content, such as words or numbers, to its document locations and is generally used to allow fast full text searches.



The first step of Inverted Index creation is Document Processing In our case is word_index() that consist of word_split(), normalization and the deletion of stop words ("the", "then", "that"...).
def word_split(text):
    word_list = []
    wcurrent = []
    windex = None

    for i, c in enumerate(text):
        if c.isalnum():
            wcurrent.append(c)
            windex = i
        elif wcurrent:
            word = u''.join(wcurrent)
            word_list.append((windex - len(word) + 1, word))
            wcurrent = []

    if wcurrent:
        word = u''.join(wcurrent)
        word_list.append((windex - len(word) + 1, word))

    return word_list
word_split() is quite a long function that does a really simple job split words. You can rewrite it with just one line using something like re.split('\W+', text).
def words_cleanup(words):
    cleaned_words = []
    for index, word in words:
        if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:
            continue
        cleaned_words.append((index, word))
    return cleaned_words

def words_normalize(words):
    normalized_words = []
    for index, word in words:
        wnormalized = word.lower()
        normalized_words.append((index, wnormalized))
    return normalized_words
Cleanup and Normalize are just to function filters to apply after word_split(). In this case cleanup remove the stopwords (frozenset of strings) and normalize convert word in lower case. But you can do something more like removing accents, transform to singular or something else.
def word_index(text):
    words = word_split(text)
    words = words_normalize(words)
    words = words_cleanup(words)
    return words
word_index() is just an helper, take an input text and does all the word splitting/normalization job and the result is a list of tuple that contains position and word. [(1, u'niners'), (13, u'coach')].
def inverted_index(text):
    inverted = {}

    for index, word in word_index(text):
        locations = inverted.setdefault(word, [])
        locations.append(index)

    return inverted
Finally we've our invertex_index() method that take a text as input and returns a dictionary with words as keys and locations (position of the words in the document) as values.
def inverted_index_add(inverted, doc_id, doc_index):
    for word, locations in doc_index.iteritems():
        indices = inverted.setdefault(word, {})
        indices[doc_id] = locations
    return inverted
The Previous method, inverted_index(), returns a dictionary with just the information for the specified document. So inverted_index_add() add Document's inverted index to a Multi-Document Inverted Index. Here we've words that are keys of the dictionary and values are dictionary with doc_id as key and document location as values. {'week':{'doc2': [149], 'doc1': [179, 187]}}.
def search(inverted, query):
    words = [word for _, word in word_index(query) if word in inverted]
    results = [set(inverted[word].keys()) for word in words]
    return reduce(lambda x, y: x & y, results) if results else []
Now that we've the inverted index, we're able to do queries on it. The function above takes a Multi-Document Inverted index and a query, and returns the set of documents that contains all the words that you've searched.

Obviously to make a serious search function you need to add ranking, phrase matches, stemming and other features of full-text search. And you need to write your dictionary using an on disk btree. But this is a basis example, of how to build an inverted index.

Source code can be found on my GitHub repository under py-inverted-index:
git clone http://github.com/matteobertozzi/blog-code.git

1 comment: