Question: In this problem you will be using pandas. The txt file information is below Attribute Information: The format of the docword.*.txt file is 3 header

In this problem you will be using pandas.

The txt file information is below

Attribute Information:

The format of the docword.*.txt file is 3 header lines, followed by NNZ triples: --- D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count ---

The format of the vocab.*.txt file is line contains wordID=n.

Here are serveral lines of the txt file

300000 102660 69679427 1 413 1 1 534 1 1 2340 1 1 2806 1 1 3059 1 1 3070 1 1 3294 1 1 3356 1 1 4056 1 1 4930 1 1 5255 1 1 6888 1 1 6946 1 1 6974 2 1 7296 1 1 7402 1 1 7405 1 1 7409 1 1 7544 1 1 7790 1 1 9085 1 1 9385 2 1 9959 1 1 9983 1 1 10126 1 1 10474 1 1 10787 1 1 11762 1 1 12610 1 1 12961 2 1 13359 1 1 13992 1 1 14255 1 1 14753 2 1 14815 1 1 14852 1 1 15503 1 1 15713 2 1 15886 1 1 16253 4 1 16385 2 1 16581 1 1 16743 1 1 16852 2 1 17654 4 1 17660 1 1 17820 1 1 18072 1 1 18200 1 1 18353 1 1 18566 1 1 18704 1 1 18990 1 1 22127 1 1 22128 1 1 22147 1 1 22291 1 1 22313 2 1 22872 1 1 23507 1 1 24489 1 1 24858 1 1 25611 2 1 25723 2 1 25724 1 1 25952 3 1 26114 1 1 26404 1 1 28051 1 1 28409 2 1 28410 1 1 29167 6 1 29176 3 1 29363 4 1 29609 1 1 30679 1 1 31013 1 1 31440 4 1 31586 1 1 31588 1 1 31745 2 1 31748 1 1 31943 1 1 32557 1 1 33023 1 1 33472 1 1 33946 2 1 34463 1 1 34498 1 1 34563 5 1 34784 1 1 34892 1 1 35491 1 1 35495 1 1 35542 1 1 37904 1 1 37961 1 1 38055 1 1 39124 2 1 39139 1 1 39144 1 1 39349 5 1 40235 1 1 40385 2 1 40479 1 1 41055 1 1 41062 2 1 41161 1 1 42526 1 1 42861 1 1 43569 1 1 43808 3 1 43981 1 1 43982 1 1 49517 1 1 49518 1 1 51749 4 1 52088 1 1 54197 1 1 56963 1 1 58669 1 1 63370 1 1 68127 1 1 78319 3 1 80416 7 1 82034 2 1 83139 1 1 83177 6 2 442 1 2 623 1 2 1020 1 2 1698 1 2 1700 1 2 1706 1 2 1894 2 2 1938 1 2 2006 1 2 2042 1 2 2188 1 2 2197 1 2 3193 1 2 3356 1 2 4203 1 2 4353 1 2 5244 1 2 5249 1 2 5796 1 2 5921 1 2 6003 1 2 6005 1 2 6111 1 2 6837 1 2 6870 1 2 7633 1 2 8428 1 2 8468 1 2 9316 1 2 9606 1 2 9746 1 2 10480 2 2 11334 1 2 12261 1 2 12263 1 2 12420 1 2 13153 1 2 13407 3 2 13850 1 2 14110 1 2 14631 1 2 14911 1 2 15722 1 2 15953 3 2 16287 1 2 16698 1

First --> You will need to read the file docword.nytimes.txt into a dataframe.

Next --> Make sure that you skip the header information. Use only pandas functionality. In particular you may not use loops. Store your answers in the variables that are provided in the file.

(a) How many documents contain more than 500 words?

(b) How many documents contain more than 100 unique words?

(d) What is the id of the word that appears the most times across all documents?

(e) What is the average number of total words per document?

Here is the first two lines of code

file = r'docword.nytimes.txt' df = pd.read_csv(file, sep = ' ', skiprows = 3, header = None)

it should only be less then 20 lines in total

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Using Spark and nothing else!!!!! The file docword.nytimes.txt answer the following questions. The file looks like this but there are over 1000000 lines The first 10 lines and the last line are of...

most of the pictures are just telling what to do. thank you so much for this i need help this is what they want me ro write need help writing the program For full credit, you must process as follows...

Please post all files including HTML,CSS and javascript. I need it tomorrow. Thank you. This assignment will be used to test your knowledge of JavaScript, Local Storage and jQuery Selectors and...

Create a mobile web site (Note: Create a new folder to hold all your files/folders). You will need to include: o index.html o Links: Link to jquery.js Link to your js script file Link to your...

The main purpose of this assignment is to give you practice using two-dimensional arrays, including passing two-dimensional arrays to functions. For Part A, you will add constants and functions...

RMIT UNIVERSITY Programming Fundamentals (COSC2531) Assignment 2 Individual assignment (no group work). Submit online via Canvas/Assignments/Assignment 2. Marks are awarded per rubric (please see the...

I have to create a program in C and I can't figure it out. The program has to read a source file. Please help. /******************************************************************** PROJECT: Glossary...

CSEBank has decided to offer a new collection of credit cards to its customers: a basic/traditional credit card, a rewards card, and a prepaid/stored-balance card. For this assignment, you will...

Problem Statement: The great wizard school of Pogwarts has been adding a lot of new spellbooks to their library lately. As a result of all the new literature, some mischievous students have been...

Vigenere with Cipher Block Chaining This assignment involves implementing the Vigenere cipher as a block cipher and using the Cipher Block Chaining mode of operation, which we have studied in lecture...

Microbiology: Stain PrimaryStain Counterstain AcidFast 1._______________________ 2. _________________________ Capsule stain 3._______________________ 4. _________________________ For questions 5-8,...

The risk-free rate is 5%, return on the market portfolio M is 12%, and standard deviation of the market portfolio M is 15%. In the context of the CAPM, consider two risky stocks A and B with the...

No one can refuse to accept a substitute check as proof of payment. True or False

On September 30, an inn invested $21,350 in a short-term investment of 220 days. An investment of this length earns 1.1% p.a. How much will the investment be worth at maturity? The investment be...

8. Keep a few printed copies available. Some employees are not comfortable with technology and may prefer a hard copy of the handbook. Print a few and let employees know that they are available on an...

1. Put the acknowledgment up front. Set up the handbook so that employees must first read all disclaimers and complete an acknowledgment before gaining access to the handbook contents.

1. Counseling: The goal of this phase is to heighten employee awareness of organizational policies and rules. Often, people simply need to be made aware of rules, and knowledge of possible...