Question: 2 Vector Space Ranking (6+8+4 pts) Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequenco): D1: BettyBotter bought some

2 Vector Space Ranking (6+8+4 pts) Consider the following document collection/corpus

2 Vector Space Ranking (6+8+4 pts) Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequenco): D1: BettyBotter bought some Butter D2: Then to Better Butter BettyBotter bought more Butter Assume that the stopword list contains words that begin with a lower case letter, and that stopwords are eliminated during pre-processing. No other change is made to tokens to get terms (e.g., the words are neither stemmed nor case-folded). For the given example, show (1) the dictionary and (2) the postings lists. Include all the relevant statistics, including the TF-IDF value as '(TF,IDF)' associated with each document id in the postings list, as detailed below. The dictionary contains terms, their (corpus) cumulative frequency, their document frequency, and a pointer labelled P1 for the ith postings list. The dictionary terms must be in lexicographic order, and so are the document ids in the postings lists. A postings list can start with a label Pi (to denote the target ith postings list), followed by the list of document ids with the associated TF-IDF statistics. The normalized length of a document is defined as the number of (non-stopword) term occurrences in the document. The term fiequency factor (IF) is the number of term occurrences in a document divided by the normalized length of the document. (You can just write the two numbers separated by a % '.) The inverse document frequency factor (IDF) is defined as the reciprocal of the number of documents that contain the term. For example, TF for the term "Buffalo" in the document "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo" is 3/6, and the IDF for the term "Buffalo" is 1/1 because it is present in just one document (which is the entire corpus). For concreteness, format the dictionary entries and postings lists under the headings shown below: Dictionary: TERM CUMULATIVE-FREQ DOCUMENT-FREQ Label-Pi (for ptr) Postings lists: (Target) Label-Pi DOC-ID: (TF,IDF) ... DOC-ID: (TF, IDF) 3) Show the "relative" ranking of the documents for the query Butter stifying it in terms of the relevance scores

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Consider the following document collection/corpus D={D1,D2} (given as one document per line in that sequence): D1:-> Asterix Asterix and the Goths D2:-> Asterix and Cleopatra Assume that the stopword...

Consider the following document collection D = {D1, D2, D3} (given as one document per line): D1 => Silly Sally Sleepy Sally D2 => Seven Silly Sheep D3 => Silly Sheep Should Sleep Silly Assume that...

Processing steps for 18 questions are required. Thanks so much for help! Queensland University of Technology QUT Business School School of Accountancy AYB 219 Taxation Law HandiTax Group Project...

this assignment is regarding return the tax of a client by using handy taxassignment. can anyone help me to complete the income section of this assignment, just write the solution in a pdf file?I...

Linear Algbra Error-Detecting and Correcting Codes In this project, we examine how we can construct a method for detecting and correcting errors made in the transmission of encoded messages. It will...

Question 2 : Implement Vector space Ranking - ( 7 + 3 ) marks Dataset: Find a suitable open - source dataset ( A minimum of 2 5 documents of same category to be used ) for computing similarity among...

The following table shows the marginal utility a consumer receives from the weekly consumption of On-Demand movie rentals and Thai takeout meals. One On-Demand movie rental costs $5, and Thai takeout...

Ask yourself: what are the nationalities of the following corporations? Consider nationality to be the country where the corporation is registered: SAP (software), BP (gasoline), CheckPoint (secu-...

Given a discount rate of 4 . 2 percent per year, what is the value at Date t = 7 of a perpetual stream of $ 2 , 6 0 0 payments with the first payment at Date t = 1 5 ? Note: Do not round intermediate...

CT Corp Comprehensive Question Canadian Tire Corporation, Limited (Canadian Tire) is a family of companies that includes a retail segment and a financial services division, among others. The retail...

3. Outline the four major approaches to informative speeches

4. Employ strategies to make your audience hungry for information

2. List and describe each of the eight categories of informative speeches