Question: Dataset: The posts.csv file holds actual user posts, with each post consisting of eight fields: ` id ` ( a unique identifier, int ) ,

Dataset: The posts.csv file holds actual user posts, with each post consisting of eight fields:

`

` (

unique identifier, int

), `

blog

` (

q unique identifier of the community to which the post had been

submitted, int

), `

user

` (

the author

s unique identifier, int

), `

posted

` (

the date of the post, string in

the YYYY

-

-

DD format

), `

url

` (

the URL of the post, string

), `

comments

` (

the number of

comments, int

), `

title

` (

the post title, string

),

and

`

body

` (

the post body, string, possibly with

HTML

) .

Instructions:

1 .

Data Cleaning and Preparation:

Import the posts.csv file using Pandas. Treat the

\

N strings as missing values. Read the

`

posted

`

column as date

(

use

`

parse

_

dates

`

and

`

date

_

format

`

options

) .

Save the cleaned DataFrame to a pickle file named "posts.p

" .

Ensure that your script loads

the clean version of the file from the cache if it exists

(

try

/

except FileNotFoundError

) .

2 .

Data Analysis and Visualization:

Report

(

and record

)

the number of posts and unique users, and the fraction of posts

without comments.

Calculate and scatter

-

plot the distribution of number of comments in logarithmic x

-

and y

-

coordinates. The x

-

axis shall be the number of comments, and the y

-

axis shall be the

number of posts with that many comments. Save the image as posts

-

zipf.pdf

.

Hint: Add

1

to the number of comments to avoid missing posts without comments on the logarithmic

scale.

Calculate and plot the histogram of the logarithm of the plots

bodies

lengths. Save the

image as posts

-

length.pdf

.

Calculate and create a horizontal bar chart showing the distribution of the number of posts

by calendar month. Label the bars with three

-

letter abbreviated months

names. Save the

image as posts

-

months.pdf

.

3 .

Content Analysis:

Preprocess each post body:

1)

Remove HTML

(

with Beautiful Soup

)

2)

Tokenize the text into words

3)

Calculate Part

-

-

Speech tags for each word

Data Science Spring

2024

CMPSC

- 310

Suffolk University A

4

Math & CS Department

4)

Lemmatize the words

5)

Convert the lemmas to lowercase and remove those on the stopwords list

Find the

400

most frequently used lemmas

(

use the Counter object

)

For each lemma, calculate the mean engagement

(

the mean number of comments

)

for the

posts that do and do not contain this lemma

(

two numbers

) .

Calculate the difference

between the numbers

it is the effect of the word on the response to the post.

Report two lists of words that cause the strongest responses: those with an effect higher

than mean

+

std or lower than mean

-

std

.

Apply VADER to each of the lists and compare the reported sentiment levels

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!

I have to create a program in C and I can't figure it out. The program has to read a source file. Please help. /******************************************************************** PROJECT: Glossary...

candy_reader.py candy.csv im not really sure how to code this at all if anyone could help U4A - Application of Functional Programming Assignment Task :: Data Exploration Given the dataset and the...

In this TP, you will implement two different Map classes to serve basic natural language processing (NLP) functions. Particularly, you will constitute datasets of numerous documents. You will use the...

NIT3171 Group Assignment, Business Analysis Case Study - 35 Marks (35%) Report Submission Date through the drop box : (10% will be deducted for each day of delay) Presentation (up to 15 minutes per...

Problem Summary A cell phone company specializes in cell phones for single people (i.e. plans for a single line). They would like to hire you to write a program to help them calculate cell phone...

* Load the binary file, filename into a Dataset and return a pointer to * the Dataset. The binary file format is as follows: 4 bytes : 'N: Number of images / labels in the file 1 byte : Image 1 label...

10. Classifying Processed Images into Multiple Classes Introduction We have processed images to represent as a vector of 300 attributes. The first 294 are decimal values in the range 0 to 1...

Accounts Receivable Data Analysis Project ACCT 303 - Intermediate Accounting! Spring 2022 The objective of this project is to develop and practice data management and analyties, skills that will be...

MatLab question, Why is the code not working? Determining Quartiles In this lab you will use specific functions to determine the quartiles of a set of data. Quartiles of a set is defined as the 25%,...

Part 1 : Program Create an application as described below to store and retrieve data using the AVL binary search tree data structure. Write an application GenericsKbAVLApp to read in the attached...

Discuss the advantages and disadvantages of histograms versus stem-and-leaf plots.

The Sales and Stock Management Information System has to be designed and developed for any shop. The main aim of this project should be to computerization of daily activities in the shop. The project...

A will prepared on a pre - printed form from a stationery store is called a ( n ) :

Can you elucidate the role of molecular chaperones and protein quality control systems in maintaining the integrity and homeostasis of the cytoplasm, particularly under conditions of cellular stress...