Question: For this assignment, you will be reading text data from a file, counting term frequency per document and document frequency, and displaying the results on

For this assignment, you will be reading text data from a file, counting term frequency per document and document frequency, and displaying the results on the screen. The full list of operations your program must support and other specific requirements are outlined below. Input Data: A sample input data file

(

hobbies

- 1 .

txt

)

that your program should process is distributed together. This sample file contains descriptions of

34

students' hobbies. Each student record spans

2

lines in the data file. The first line contains the ID of the student

(

.

.,

student

_1,

student

_2,

and student

_3) .

I anonymized the student id

.

The second line contains the description of the students hobby. Your program will need to count term frequency per document

(

number of times each term occurs in each document

)

and document frequency

(

number of documents where each term appears

)

and record them in dictionaries.For instance, we will assume that there are hobby descriptions of three students.Student

_1

I love soccer.Student

_2

I play basketball every day and play soccer sometimes.Student

_3

I love playing the violin.The term frequency per document can be recorded in a dictionary as follows:

{

Student

_1

{

i :

1,

love :

1,

soccer :

1},

Student

_2

{

i :

1,

play :

2,

basketball :

1,

every :

1,

day :

1,

and :

1,

soccer :

1,

sometimes:

1},

Student

_3

{

i :

1,

love :

1,

playing :

1,

the:

1,

violin:

1}}

The document frequency can be recorded in another dictionary as follows:

{

i :

3,

love :

2,

play :

1,

basketball :

1,

every :

1,

day :

1,

and :

1,

soccer :

2,

sometimes:

1,

playing:

1,

the:

1,

violin:

1}

The term frequency is recorded in dictionaries of a dictionary in the above example. All alphabet characters are changed to lower cases. In the actual submission, the stopwords

(

.

.,

I and the

)

need to be removed. The examples above are to explain the structure of sample dictionaries and the concept of term frequency and document frequency.Output: Your program needs to show the results of counting on the screen. An example of the required output

(6_

sample

_

output.txt

)

is distributed together. RequirementYour program should perform following functionalities:Prompt the user to enter the name of the input file, making sure that the file exists and asking the user to re

-

enter a filename if needed. Then read the file

(

student IDs and hobby descriptions

) .

Tokenize

(

dividing a string of written language into its component words

)

the hobby descriptions by using NLTK word

_

tokenize function. For this process, you need to install NLTK library. Then you should import the library to actually use it

(

.

.,

import nltk

) .

tokens

=

nltk

.

word

_

tokenize

(

hobby

_

text

)

is an example statement. The word

_

tokenize function gets hobby

_

text as input and returns a list of tokens.Remove period and comma from the tokens. Convert all tokens to lower cases.Remove stopwords

(

most common words in a language and usually need to be removed before natural language processing

)

from tokens. Calculate term frequency per each document

(

a nested dictionary

)

and save it as a value in another dictionary. The paired key of the term frequency should be student ID

.

Calculate document frequency and save it in another dictionaryAsk the user a word for which the user looks term frequency and document frequency. If the user enters a wrong word that does not appear in the hobby descriptions, then the program should keep asking until the user enter a correct word. The user needs to be able to search term frequency and document frequency as many times as possible. If the user enters blank, the program should be terminated.Developing the solution for this program would be quite challenging without using functions. To make your job easier, think about how functions can be used to simplify the design. Your solution should have, at a minimum, the following functions:main: the main function, which should control the flow of the program.removePeriodsCommas: removes periods and commas from a list of tokens. converToLower: convert all tokens to lower cases.removeStopWords: remove stopwords from tokens. You need to import stopwords from NLTK

(

.

.,

from nltk

.

corpus import stopwords

)

and then you can retrieve the stopwords list by stating stopwords.words

("

english

") .

This statement will return a list of stopwords.calTermFreq: calculate term frequency per documentcalDocFreq: calculate document frequency

* * * *

PLEASE WRITE THE FOLLOWING PROGRAM USING PYTHON

* * * *

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

import java.lang.reflect.Array; public class ArrayList extends List { private int size; private int capacity; private Object[] ls; // TODO: default: should create an arraylist that is capable of...

You are doing a fantastic job at Chada Tech in your new role as a junior developer, and you exceeded expectations in your last assignment for Airgead Banking. Since your team is impressed with your...

Background Information This assignment tests your understanding of and ability to apply the programming concepts we have covered throughout the unit. The concepts covered in the second half of the...

i want complete solution for my assignment and it should be without plagiarism COIT20274: Information Systems for Business Professionals, Term One 2016 Assignments 1 & 2 Requirements Assignment 1 -...

Need Urgent Help with Question 3 and Question 9 of the attached document - Question 3 needs only ~400 words (also attaching an article which might be needed to answer question-3) My word count for...

IfyouhaveplayedaSimulationcalledProBankerIneedhelpansweringthesequestionsassoonaspossible from the pro bankerassignment attachment..please use spreadsheet and players manual for reference. Need...

Python and most Python libraries are free to download or use, though many users use Python through a paid service. Paid services help IT organizations manage the risks associated with the use of...

Study Guide Healthcare Statistics By Jacqueline K. Wilson, RHIA About the Author Jacqueline K. Wilson is a Registered Health Information Administrator (RHIA) who has more than ten years of experience...

The following reconciling items are applicable to the bank reconciliation for Hinckley Company: (1) Outstanding checks, (2) Bank debit memorandum for service charge, (3) Bank credit memorandum for...

Design a combinational circuit to convert a 4-bit binary number to gray code using (a) standard logic gates, (b) decoder, (c) 8-to-1 multiplexer, (d) 4-to-1 multiplexer.

Thanks for sharing the latest data on our vwious human mesource intiatives. Pat of ahat rm punhing us to do in thia area is to be more focused in what we offer and ensure that our choices are based...

The polynomial of degree 4, P(x) has a root of multiplicity 2 at x = 1 and roots of multiplicity 1 at 0 and x = 4. It goes through the point (5, 72). I - Find a formula for P(x). P(x) = Question Help: