Question: In this problem you will be using pandas. The txt file information is below Attribute Information: The format of the docword.*.txt file is 3 header
In this problem you will be using pandas.
The txt file information is below
Attribute Information:
The format of the docword.*.txt file is 3 header lines, followed by NNZ triples: --- D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count ---
The format of the vocab.*.txt file is line contains wordID=n.
Here are serveral lines of the txt file
300000 102660 69679427 1 413 1 1 534 1 1 2340 1 1 2806 1 1 3059 1 1 3070 1 1 3294 1 1 3356 1 1 4056 1 1 4930 1 1 5255 1 1 6888 1 1 6946 1 1 6974 2 1 7296 1 1 7402 1 1 7405 1 1 7409 1 1 7544 1 1 7790 1 1 9085 1 1 9385 2 1 9959 1 1 9983 1 1 10126 1 1 10474 1 1 10787 1 1 11762 1 1 12610 1 1 12961 2 1 13359 1 1 13992 1 1 14255 1 1 14753 2 1 14815 1 1 14852 1 1 15503 1 1 15713 2 1 15886 1 1 16253 4 1 16385 2 1 16581 1 1 16743 1 1 16852 2 1 17654 4 1 17660 1 1 17820 1 1 18072 1 1 18200 1 1 18353 1 1 18566 1 1 18704 1 1 18990 1 1 22127 1 1 22128 1 1 22147 1 1 22291 1 1 22313 2 1 22872 1 1 23507 1 1 24489 1 1 24858 1 1 25611 2 1 25723 2 1 25724 1 1 25952 3 1 26114 1 1 26404 1 1 28051 1 1 28409 2 1 28410 1 1 29167 6 1 29176 3 1 29363 4 1 29609 1 1 30679 1 1 31013 1 1 31440 4 1 31586 1 1 31588 1 1 31745 2 1 31748 1 1 31943 1 1 32557 1 1 33023 1 1 33472 1 1 33946 2 1 34463 1 1 34498 1 1 34563 5 1 34784 1 1 34892 1 1 35491 1 1 35495 1 1 35542 1 1 37904 1 1 37961 1 1 38055 1 1 39124 2 1 39139 1 1 39144 1 1 39349 5 1 40235 1 1 40385 2 1 40479 1 1 41055 1 1 41062 2 1 41161 1 1 42526 1 1 42861 1 1 43569 1 1 43808 3 1 43981 1 1 43982 1 1 49517 1 1 49518 1 1 51749 4 1 52088 1 1 54197 1 1 56963 1 1 58669 1 1 63370 1 1 68127 1 1 78319 3 1 80416 7 1 82034 2 1 83139 1 1 83177 6 2 442 1 2 623 1 2 1020 1 2 1698 1 2 1700 1 2 1706 1 2 1894 2 2 1938 1 2 2006 1 2 2042 1 2 2188 1 2 2197 1 2 3193 1 2 3356 1 2 4203 1 2 4353 1 2 5244 1 2 5249 1 2 5796 1 2 5921 1 2 6003 1 2 6005 1 2 6111 1 2 6837 1 2 6870 1 2 7633 1 2 8428 1 2 8468 1 2 9316 1 2 9606 1 2 9746 1 2 10480 2 2 11334 1 2 12261 1 2 12263 1 2 12420 1 2 13153 1 2 13407 3 2 13850 1 2 14110 1 2 14631 1 2 14911 1 2 15722 1 2 15953 3 2 16287 1 2 16698 1
First --> You will need to read the file docword.nytimes.txt into a dataframe.
Next --> Make sure that you skip the header information. Use only pandas functionality. In particular you may not use loops. Store your answers in the variables that are provided in the file.
(a) How many documents contain more than 500 words?
(b) How many documents contain more than 100 unique words?
(c) How many words occur in more than 1000 documents?
(d) What is the id of the word that appears the most times across all documents?
(e) What is the average number of total words per document?
Here is the first two lines of code
file = r'docword.nytimes.txt' df = pd.read_csv(file, sep = ' ', skiprows = 3, header = None)
it should only be less then 20 lines in total
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
