Question: In this problem you will be using pandas. The txt file information is below Attribute Information: The format of the docword.*.txt file is 3 header

In this problem you will be using pandas.

The txt file information is below

Attribute Information:

The format of the docword.*.txt file is 3 header lines, followed by NNZ triples: --- D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count ---

The format of the vocab.*.txt file is line contains wordID=n.

Here are serveral lines of the txt file

300000 102660 69679427 1 413 1 1 534 1 1 2340 1 1 2806 1 1 3059 1 1 3070 1 1 3294 1 1 3356 1 1 4056 1 1 4930 1 1 5255 1 1 6888 1 1 6946 1 1 6974 2 1 7296 1 1 7402 1 1 7405 1 1 7409 1 1 7544 1 1 7790 1 1 9085 1 1 9385 2 1 9959 1 1 9983 1 1 10126 1 1 10474 1 1 10787 1 1 11762 1 1 12610 1 1 12961 2 1 13359 1 1 13992 1 1 14255 1 1 14753 2 1 14815 1 1 14852 1 1 15503 1 1 15713 2 1 15886 1 1 16253 4 1 16385 2 1 16581 1 1 16743 1 1 16852 2 1 17654 4 1 17660 1 1 17820 1 1 18072 1 1 18200 1 1 18353 1 1 18566 1 1 18704 1 1 18990 1 1 22127 1 1 22128 1 1 22147 1 1 22291 1 1 22313 2 1 22872 1 1 23507 1 1 24489 1 1 24858 1 1 25611 2 1 25723 2 1 25724 1 1 25952 3 1 26114 1 1 26404 1 1 28051 1 1 28409 2 1 28410 1 1 29167 6 1 29176 3 1 29363 4 1 29609 1 1 30679 1 1 31013 1 1 31440 4 1 31586 1 1 31588 1 1 31745 2 1 31748 1 1 31943 1 1 32557 1 1 33023 1 1 33472 1 1 33946 2 1 34463 1 1 34498 1 1 34563 5 1 34784 1 1 34892 1 1 35491 1 1 35495 1 1 35542 1 1 37904 1 1 37961 1 1 38055 1 1 39124 2 1 39139 1 1 39144 1 1 39349 5 1 40235 1 1 40385 2 1 40479 1 1 41055 1 1 41062 2 1 41161 1 1 42526 1 1 42861 1 1 43569 1 1 43808 3 1 43981 1 1 43982 1 1 49517 1 1 49518 1 1 51749 4 1 52088 1 1 54197 1 1 56963 1 1 58669 1 1 63370 1 1 68127 1 1 78319 3 1 80416 7 1 82034 2 1 83139 1 1 83177 6 2 442 1 2 623 1 2 1020 1 2 1698 1 2 1700 1 2 1706 1 2 1894 2 2 1938 1 2 2006 1 2 2042 1 2 2188 1 2 2197 1 2 3193 1 2 3356 1 2 4203 1 2 4353 1 2 5244 1 2 5249 1 2 5796 1 2 5921 1 2 6003 1 2 6005 1 2 6111 1 2 6837 1 2 6870 1 2 7633 1 2 8428 1 2 8468 1 2 9316 1 2 9606 1 2 9746 1 2 10480 2 2 11334 1 2 12261 1 2 12263 1 2 12420 1 2 13153 1 2 13407 3 2 13850 1 2 14110 1 2 14631 1 2 14911 1 2 15722 1 2 15953 3 2 16287 1 2 16698 1

First --> You will need to read the file docword.nytimes.txt into a dataframe.

Next --> Make sure that you skip the header information. Use only pandas functionality. In particular you may not use loops. Store your answers in the variables that are provided in the file.

(a) How many documents contain more than 500 words?

(b) How many documents contain more than 100 unique words?

(c) How many words occur in more than 1000 documents?

(d) What is the id of the word that appears the most times across all documents?

(e) What is the average number of total words per document?

Here is the first two lines of code

file = r'docword.nytimes.txt' df = pd.read_csv(file, sep = ' ', skiprows = 3, header = None)

it should only be less then 20 lines in total

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!