Following along Prof. Dr. Hannah Bast to go through IR scheduled as

In the very first lecture, she taught inverted index and the concept of Zipf’s law.
Starting with a naïve searching, key word scouring through 2000 movie scripts, we want to learn the occurrence of each word showing up in these movies. inverted index (also referred to as a postings file or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table (per wiki). In her words, it is store in a map from strings(words) to arrays of ints(ids). Now is the key part – live demo. Type in Unix environment, after $ sign, vim inverted_index.py.
"""
copyright
chair of data science
author:hannah bast <bast@cs.uni-freiburg.de>
"""
import re
import sys
file_name = 'example.txt'
class InvertedIndex:
""" A very simpel inverted index. """
def __init__(self):
""" Create an empty inverted index. """
self.inverted_lists = {}
def read_from_file(self, file_name):
""" Construct index from given file.
>>> ii = InvertedIndex()
>>> ii.read_from_file("example.txt")
>>> sorted(ii.inverted_lists.items())
{ 'document':[1, 2, 3], 'first': [1], 'second': [2], 'third': [3], }
True
"""
record_id = 0
with open(file_name) as file:
for line in file:
record_id += 1
words = re.split("[^a-zA-Z]+", line)
for word in words:
if len(word) > 0:
word = word.lower()
if word not in self.inverted_lists:
self.inverted_lists[word] = []
self.inverted_lists[word].append
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python3 inverted_index.py <file name>")
sys.exit(1)
file_name = sys.argv[1]
ii = InvertedIndex()
ii.read_from_file(file_name)
for word, inverted_list in ii.inverted_lists.items():
print("%s\t%d" % (word, len(inverted_list))
# after $ sign, python3 inverted_index.py movies.txt
