Question: For this assignment, you will be reading text data from a file, counting term frequency per document and document frequency, and displaying the results on

For this assignment, you will be reading text data from a file, counting term frequency per document and document frequency, and displaying the results on the screen. The full list of operations your program must support and other specific requirements are outlined below. Input Data: A sample input data file (hobbies-1.txt) that your program should process is distributed together. This sample file contains descriptions of 34 students' hobbies. Each student record spans 2 lines in the data file. The first line contains the ID of the student (e.g., student_1, student_2, and student_3). I anonymized the student id. The second line contains the description of the students hobby. Your program will need to count term frequency per document (number of times each term occurs in each document) and document frequency (number of documents where each term appears) and record them in dictionaries.For instance, we will assume that there are hobby descriptions of three students.Student_1 I love soccer.Student_2 I play basketball every day and play soccer sometimes.Student_3 I love playing the violin.The term frequency per document can be recorded in a dictionary as follows:{Student_1 : {i : 1, love : 1, soccer : 1}, Student_2: {i : 1, play : 2, basketball : 1, every : 1, day : 1, and : 1, soccer : 1, sometimes: 1}, Student_3: {i : 1, love : 1, playing : 1, the: 1, violin: 1}}The document frequency can be recorded in another dictionary as follows:{i : 3, love : 2, play : 1, basketball : 1, every : 1, day : 1, and : 1, soccer : 2, sometimes: 1, playing: 1, the: 1, violin: 1}The term frequency is recorded in dictionaries of a dictionary in the above example. All alphabet characters are changed to lower cases. In the actual submission, the stopwords (e.g., I and the) need to be removed. The examples above are to explain the structure of sample dictionaries and the concept of term frequency and document frequency.Output: Your program needs to show the results of counting on the screen. An example of the required output (6_sample_ output.txt) is distributed together. RequirementYour program should perform following functionalities:Prompt the user to enter the name of the input file, making sure that the file exists and asking the user to re-enter a filename if needed. Then read the file (student IDs and hobby descriptions).Tokenize (dividing a string of written language into its component words) the hobby descriptions by using NLTK word_tokenize function. For this process, you need to install NLTK library. Then you should import the library to actually use it (e.g., import nltk). tokens = nltk.word_tokenize(hobby_text) is an example statement. The word_tokenize function gets hobby_text as input and returns a list of tokens.Remove period and comma from the tokens. Convert all tokens to lower cases.Remove stopwords (most common words in a language and usually need to be removed before natural language processing) from tokens. Calculate term frequency per each document (a nested dictionary) and save it as a value in another dictionary. The paired key of the term frequency should be student ID.Calculate document frequency and save it in another dictionaryAsk the user a word for which the user looks term frequency and document frequency. If the user enters a wrong word that does not appear in the hobby descriptions, then the program should keep asking until the user enter a correct word. The user needs to be able to search term frequency and document frequency as many times as possible. If the user enters blank, the program should be terminated.Developing the solution for this program would be quite challenging without using functions. To make your job easier, think about how functions can be used to simplify the design. Your solution should have, at a minimum, the following functions:main: the main function, which should control the flow of the program.removePeriodsCommas: removes periods and commas from a list of tokens. converToLower: convert all tokens to lower cases.removeStopWords: remove stopwords from tokens. You need to import stopwords from NLTK (e.g., from nltk.corpus import stopwords) and then you can retrieve the stopwords list by stating stopwords.words("english"). This statement will return a list of stopwords.calTermFreq: calculate term frequency per documentcalDocFreq: calculate document frequency
**** PLEASE WRITE THE FOLLOWING PROGRAM USING PYTHON****

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!