Question: 2 Vector Space Ranking (6+8+4 pts) Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequenco): D1: BettyBotter bought some

 2 Vector Space Ranking (6+8+4 pts) Consider the following document collection/corpus

2 Vector Space Ranking (6+8+4 pts) Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequenco): D1: BettyBotter bought some Butter D2: Then to Better Butter BettyBotter bought more Butter Assume that the stopword list contains words that begin with a lower case letter, and that stopwords are eliminated during pre-processing. No other change is made to tokens to get terms (e.g., the words are neither stemmed nor case-folded). For the given example, show (1) the dictionary and (2) the postings lists. Include all the relevant statistics, including the TF-IDF value as '(TF,IDF)' associated with each document id in the postings list, as detailed below. The dictionary contains terms, their (corpus) cumulative frequency, their document frequency, and a pointer labelled P1 for the ith postings list. The dictionary terms must be in lexicographic order, and so are the document ids in the postings lists. A postings list can start with a label Pi (to denote the target ith postings list), followed by the list of document ids with the associated TF-IDF statistics. The normalized length of a document is defined as the number of (non-stopword) term occurrences in the document. The term fiequency factor (IF) is the number of term occurrences in a document divided by the normalized length of the document. (You can just write the two numbers separated by a % '.) The inverse document frequency factor (IDF) is defined as the reciprocal of the number of documents that contain the term. For example, TF for the term "Buffalo" in the document "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo" is 3/6, and the IDF for the term "Buffalo" is 1/1 because it is present in just one document (which is the entire corpus). For concreteness, format the dictionary entries and postings lists under the headings shown below: Dictionary: TERM CUMULATIVE-FREQ DOCUMENT-FREQ Label-Pi (for ptr) Postings lists: (Target) Label-Pi DOC-ID: (TF,IDF) ... DOC-ID: (TF, IDF) 3) Show the "relative" ranking of the documents for the query Butter stifying it in terms of the relevance scores

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!