Question: Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequence): D1:-> Asterix Asterix and the Goths D2:-> Asterix and Cleopatra

Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequence): D1:-> Asterix Asterix and the Goths D2:-> Asterix and Cleopatra Assume that the stopword list contains the words "the" and "and", and are eliminated during pre-processing. No other change is made to tokens to get terms (that is, the words are neither stemmed nor case folded). For the given example, show (1) the dictionary and (2) the postings lists. Include all the relevant statistics, that is, TF-IDF values as '(TF,IDF)', with each document id in the postings list, which is necessary for implementing inverted index structure for Vector Space Model-based Ranked Retrieval. The dictionary contains terms, their (corpus) cumulative frequency, their document frequency, and a pointer to the postings list. The latter can be just a label Pi for the ith pointer in your answer. Your ith postings list can begin with Pi followed by the list of document ids with associated TF-IDF statistics. The dictionary terms are in lexicographic order, and so are the document ids in the postings lists. The normalized length of a document is defined as the number of (non-stopword) term occurrences in the document. The term frequency factor (TF) is the number of term occurrences in a document divided by the normalized length of the document. The inverse document frequency factor (IDF) is defined as the reciprocal of the number of documents that contain the term. For example, TF for the term "Buffalo" in the document "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo" is 3/8, and the IDF for the term "Buffalo" is 1/1 because it is present in just one document (which is the entire corpus). For concreteness, format the dictionary entries and postings lists as: (3) Show the "relative" ranking of the documents for the query Asterix, justifying it in terms of the relevance scores
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
