Question: For this assignment, you will be reading text data from a file, counting term frequency per document and document frequency, and displaying the results on
For this assignment, you will be reading text data from a file, counting term frequency per document and document frequency, and displaying the results on the screen. The full list of operations your program must support and other specific requirements are outlined below. Input Data: A sample input data file hobbiestxt that your program should process is distributed together. This sample file contains descriptions of students' hobbies. Each student record spans lines in the data file. The first line contains the ID of the student eg student student and student I anonymized the student id The second line contains the description of the students hobby. Your program will need to count term frequency per document number of times each term occurs in each document and document frequency number of documents where each term appears and record them in dictionaries.For instance, we will assume that there are hobby descriptions of three students.Student I love soccer.Student I play basketball every day and play soccer sometimes.Student I love playing the violin.The term frequency per document can be recorded in a dictionary as follows:Student : i : love : soccer : Student: i : play : basketball : every : day : and : soccer : sometimes: Student: i : love : playing : the: violin: The document frequency can be recorded in another dictionary as follows:i : love : play : basketball : every : day : and : soccer : sometimes: playing: the: violin: The term frequency is recorded in dictionaries of a dictionary in the above example. All alphabet characters are changed to lower cases. In the actual submission, the stopwords eg I and the need to be removed. The examples above are to explain the structure of sample dictionaries and the concept of term frequency and document frequency.Output: Your program needs to show the results of counting on the screen. An example of the required output sample output.txt is distributed together. RequirementYour program should perform following functionalities:Prompt the user to enter the name of the input file, making sure that the file exists and asking the user to reenter a filename if needed. Then read the file student IDs and hobby descriptionsTokenize dividing a string of written language into its component words the hobby descriptions by using NLTK wordtokenize function. For this process, you need to install NLTK library. Then you should import the library to actually use it eg import nltk tokens nltkwordtokenizehobbytext is an example statement. The wordtokenize function gets hobbytext as input and returns a list of tokens.Remove period and comma from the tokens. Convert all tokens to lower cases.Remove stopwords most common words in a language and usually need to be removed before natural language processing from tokens. Calculate term frequency per each document a nested dictionary and save it as a value in another dictionary. The paired key of the term frequency should be student IDCalculate document frequency and save it in another dictionaryAsk the user a word for which the user looks term frequency and document frequency. If the user enters a wrong word that does not appear in the hobby descriptions, then the program should keep asking until the user enter a correct word. The user needs to be able to search term frequency and document frequency as many times as possible. If the user enters blank, the program should be terminated.Developing the solution for this program would be quite challenging without using functions. To make your job easier, think about how functions can be used to simplify the design. Your solution should have, at a minimum, the following functions:main: the main function, which should control the flow of the program.removePeriodsCommas: removes periods and commas from a list of tokens. converToLower: convert all tokens to lower cases.removeStopWords: remove stopwords from tokens. You need to import stopwords from NLTK eg from nltkcorpus import stopwords and then you can retrieve the stopwords list by stating stopwords.wordsenglish This statement will return a list of stopwords.calTermFreq: calculate term frequency per documentcalDocFreq: calculate document frequency
PLEASE WRITE THE FOLLOWING PROGRAM USING PYTHON
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
