Question: (PYTHON) PLEASE FOLLOW TEMPLATE PROVIDED IN THE BOTTOM Your task is to write a Python program that opens and reads a very large text file.
(PYTHON) PLEASE FOLLOW TEMPLATE PROVIDED IN THE BOTTOM
Your task is to write a Python program that opens and reads a very large text file.
The program prompts the user to enter the file name.
The program then computes some language statistics based on the contents of the file.
The longest word used in the file. If there is more than one, just print one of the longest. There is no need to find all the longest words.
The five most common words in the file with the number of times they appear in the file.
The word count of all the words in the file, sorted alphabetically this last output has to be written to a file in the current working directory with the name out.txt. Open this file for writing in w mode.
Your program must work with any input text file that uses 'UTF-8' encoding. The program must also be efficient, computing statistics for a large file (pride.txt) in seconds.
Make sure that you include docstrings, comments where necessary and follow the general Python style guidelines: no long lines in your code, pythonic variable and function names, etc
Use functions to structure your code.
You dont have to worry about validating the input at this point. Assume that the file name provided by the user exists.
Assume that the file is really large so you dont want to read it all at once. However you only want to open the file once and read each line once.
Make sure that you ignore capitalization when you process the file. So This and this should be counted as the same word.
Also make sure that you take out leading and trailing punctuation characters and numbers from your words.
here and here. should both be considered one word. Same with Hi, and Hi!.
Punctuation characters inside words should be kept so that hyphenated words such as 'arm-in-arm' and contractions such as "don't" are left intact.
Testing:
Start testing with a small text file (5-10 lines). Once you have it working, try the larger files.
Note that your program must generate console output as well as a file output.
Below is the console output based on the text file provided - Pride and Prejudice novel by Jane Austen. Note that your program has to print one of the longest words, not all of them.
The longest word is: disinterestedness
or:
The longest word is: misrepresentation
or:
The longest word is: communicativeness
The 5 most common words are:
the: 4332
to: 4138
of: 3612
and: 3580
her: 2225
(*****USE THIS TEMPLATE*****)
""" Docstring: Enter your one-line overview here and your detailed description """ import string import random def count_words(filename): """ Enter your function docstring here """ # build and return the dictionary for the given filename def report(word_dict): """ Enter your function docstring here """ # report on various statistics based on the given word count dictionary def main(): # get the input filename and save it in a variable # call count_words to build the dictionary for the given file # save the dictionary in the variable word_count # call report to report on the contents of the dictionary word_count # If you want to generate a word cloud, uncomment the line below. # draw_cloud(word_count) if __name__ == '__main__': main()
Hints:
Write a function count_words that reads a given file, line by line, then word by word, andreturns a dictionary. The dictionary should have an entry for each word in the file. The value corresponding to a given word should be the number of times the word appears in the file.
The dictionary will be of the form :
{'the': 20, 'ate': 1, 'morning': 2, etc...}
As you process a given word from a given line in the file you can check whether that word is already in the dictionary: if it is, you update the count. Otherwise you create a new dictionary entry and initialize it.
Use the sorted function on the dictionary to get the different sorts. For some, you'll need to specify the keyargument.
Take advantage of Python built-in function max.
Take advantage of list slicing: my_list[0:10] is the list of the first 10 items in my_list.
Make sure that you open the input file only once and read it one line at a time.
This is how the output is supposed to look like from 'out.txt'
a: 1948
abatement: 1
abhorrence: 6
abhorrent: 1
abide: 1
abiding: 1
abilities: 6
able: 54
ablution: 1
abode: 8
abominable: 6
abominably: 4
abominate: 2
abound: 1
about: 122
above: 21
abroad: 4
abrupt: 1
abruptly: 2
etc..
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
