Question: Question 2 a) Write code that opens the file term_data.txt and loads data into the following variables, in this order: termCount = number of times

Question 2 a) Write code that opens the file "term_data.txt" and

Question 2

a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order:

termCount = number of times the term appeared in the document

length = total word count for the document

docCount = number of documents the term appears in at least once

totalDocs = total number of documents in the collection

Hint: You will need to include the right header file to complete this question.

b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console.

Question 3

It turns out that when calculating tf-idf, we frequently run into problems with underflow. Also, in large document collections, the idf value is so big that it dominates the calculation. So, we take the logarithm of the idf to dampen its effect. Which logarithm? It doesnt really matter. Base 2, 10, or e are all used.

Unfortunately, theres still a problem. The log of 0 is undefined. If you are getting your document frequencies from your collection, then this is not a problem. Any word in a document in your collection will appear in at least one document in your collection, by definition. But sometimes instead we take the document frequency from somewhere else, and a word that appears in our document is not in that collection at all. So, to compensate for this, we also add a smoothing term, which is a fancy way of saying, we add 1. So now, you would calculate idf like this:

idf = log (totalDocs/(docCount + 1))

Duplicate your program from question 2, and then modify it to calculate idf, and therefore tf-idf, using this new formula.

Hint: You may need to include a new header file to make this work!

4. Bonus question: Flowchart

Make a flowchart for the program from question 1 or 3; your choice which one.

text file info: 12 745 1459 1000000

a) Write code that opens the file "term_data.txt" and loads data into the following variables, in this order: termCount = number of times the term appeared in the document length = total word count for the document docCount = number of documents the term appears in at least once totalDocs = total number of documents in the collection Hint: You will need to include the right header file to complete this question. b) Continue by adding code that calculates tf, idf, and tf-idf, and prints all three to the console. Question 3 It turns out that when calculating tf-idf, we frequently run into problems with underflow. Also, in large document collections, the idf value is so big that it dominates the calculation. So, we take the logarithm of the idf to dampen its effect. Which logarithm? It doesn't really matter. Base 2,10 , or e are all used. Unfortunately, there's still a problem. The log of 0 is undefined. If you are getting your document frequencies from your collection, then this is not a problem. Any word in a document in your collection will appear in at least one document in your collection, by definition. But sometimes instead we take the document frequency from somewhere else, and a word that appears in our document is not in that collection at all. So, to compensate for this, we also add a "smoothing" term, which is a fancy way of saying, we add 1 . So now, you would calculate idf like this: idf=log(totalDocs/(docCount+1)) Duplicate your program from question 2, and then modify it to calculate idf, and therefore tf-idf, using this new formula. Hint: You may need to include a new header file to make this work! 4. Bonus question: Flowchart Make a flowchart for the program from question 1 or 3 ; your choice which one

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

NEED ASAP PLEASE. Deliverables: 1. For each programming question, you must hand in: 1. A cpp source file named A2Q\#[stno].cpp, properly documented. So if your student number is 201234566 and this is...

Any help with this would be greatly appreciated! An HR Payroll System Computers are great at storing and manipulating data. For this project, you will write code that keeps track of Human Resources...

Here is the gettysburg.txt file Four score and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition that all men are created...

How do I create this python scypt? Specification Reads in a text file Creates a dictionary of word frequencies In this dictionary they keys will be words found in the file The values would be the...

(use python)This program will read multiple basketball players data from a file. The program then calculates some statistical data and display them on the screen. The program requires from you to...

I have posted everything that my prof gave me. ( and I only struggle with question 4 and 5) bulbasaur, grass, South America Ivysaur, grass, Asia, Antarctica venusaur, grass, Africa, South America...

Econ 295, Fall 2017, Prof. Lesica DU E : December 6, 2017 ASSIGNMENT 3, Points: 60 + 6 Bonus F F F F F AT THE BEGINNING OF CLASS ONLY This is not a group assignment. Even if you work in a group, you...

kindly answer all synchronously via calls from user level? [13 marks] (b) In what way did the provision of a protected address space per process affect the development of operating systems? Describe...

Purpose: Now that we have looked at Merge Sort and Quick Sort (and discussed a couple Quick Sort variations), we'd like to empirically verify what we have discussed about their relative efficiencies....

Answer one question from each group below. Add the question number and wording before your code. If you need help or clarifications, let me know. Group1: Declare an array named scores of twenty-five...

Cherry Corporation, a calendar year C corporation, is formed and begins business on April 1, 2015. In connection with its formation, Cherry incurs organizational expenditures of $54,000. Determine...

Heres a pie chart of the data in Exercise 16. a) Which display of these data is best for comparing the market shares of these brands? Explain. b) Does Mountain Dew or Dr Pepper have a bigger market...

13. (a) State the range of audible frequencies for humans (b) Explain how sound waves of different frequencies are produced. (c) Explain how sound waves of different loudness are produced.

Which of the following are problems with identifying users of ABC? Multiple select question. ABC means different things to different organizations. Organizations will announce the discontinuance of...

Prepare the text of a form message to be sent to the people you interviewed but will not hire.

Based on the job description you prepared, develop five traditional and five behavioral questions you can use as you conduct telephone interviews with candidates.

How would you turn into a positive the answer to the question, What do you consider your weaknesses? (Objective 3)