Information Retrieval Systematic 1

Following along Prof. Dr. Hannah Bast to go through IR scheduled as

In the very first lecture, she taught inverted index and the concept of Zipf’s law.

Starting with a naïve searching, key word scouring through 2000 movie scripts, we want to learn the occurrence of each word showing up in these movies. inverted index (also referred to as a postings file or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table (per wiki). In her words, it is store in a map from strings(words) to arrays of ints(ids). Now is the key part – live demo. Type in Unix environment, after $ sign, vim inverted_index.py.

"""
copyright
chair of data science
author:hannah bast <bast@cs.uni-freiburg.de>
"""
import re
import sys

file_name = 'example.txt'
class InvertedIndex:
    """ A very simpel inverted index. """
    def __init__(self):
        """ Create an empty inverted index. """
        self.inverted_lists = {}

    def read_from_file(self, file_name):
        """ Construct index from given file. 
        >>> ii = InvertedIndex()
        >>> ii.read_from_file("example.txt")
        >>> sorted(ii.inverted_lists.items()) 
        { 'document':[1, 2, 3], 'first': [1], 'second': [2], 'third': [3], }
        True
        """
        record_id = 0
        with open(file_name) as file:
            for line in file:
                record_id += 1
                words = re.split("[^a-zA-Z]+", line)
                for word in words:
                    if len(word) > 0:
                        word = word.lower()
                        if word not in self.inverted_lists:
                            self.inverted_lists[word] = []
                        self.inverted_lists[word].append

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python3 inverted_index.py <file name>")
        sys.exit(1)
    file_name = sys.argv[1]

    ii = InvertedIndex()
    ii.read_from_file(file_name)
    for word, inverted_list in ii.inverted_lists.items():
        print("%s\t%d" % (word, len(inverted_list))

# after $ sign, python3 inverted_index.py movies.txt

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.