Question: Dataset: The posts.csv file holds actual user posts, with each post consisting of eight fields: ` id ` ( a unique identifier, int ) ,

Dataset: The posts.csv file holds actual user posts, with each post consisting of eight fields: `id`(a
unique identifier, int),`blog`(q unique identifier of the community to which the post had been
submitted, int),`user`(the authors unique identifier, int),`posted`(the date of the post, string in
the YYYY-MM-DD format),`url`(the URL of the post, string),`comments`(the number of
comments, int),`title`(the post title, string), and `body`(the post body, string, possibly with
HTML).
Instructions:
1. Data Cleaning and Preparation:
Import the posts.csv file using Pandas. Treat the \N strings as missing values. Read the
`posted` column as date (use `parse_dates` and `date_format` options).
Save the cleaned DataFrame to a pickle file named "posts.p". Ensure that your script loads
the clean version of the file from the cache if it exists (try/except FileNotFoundError).
2. Data Analysis and Visualization:
Report (and record) the number of posts and unique users, and the fraction of posts
without comments.
Calculate and scatter-plot the distribution of number of comments in logarithmic x- and y-
coordinates. The x-axis shall be the number of comments, and the y-axis shall be the
number of posts with that many comments. Save the image as posts-zipf.pdf. Hint: Add 1
to the number of comments to avoid missing posts without comments on the logarithmic
scale.
Calculate and plot the histogram of the logarithm of the plots bodies lengths. Save the
image as posts-length.pdf.
Calculate and create a horizontal bar chart showing the distribution of the number of posts
by calendar month. Label the bars with three-letter abbreviated months names. Save the
image as posts-months.pdf.
3. Content Analysis:
Preprocess each post body:
1) Remove HTML (with Beautiful Soup)
2) Tokenize the text into words
3) Calculate Part-of-Speech tags for each word
Data Science Spring 2024 CMPSC-310
Suffolk University A4 Math & CS Department
4) Lemmatize the words
5) Convert the lemmas to lowercase and remove those on the stopwords list
Find the 400 most frequently used lemmas (use the Counter object)
For each lemma, calculate the mean engagement (the mean number of comments) for the
posts that do and do not contain this lemma (two numbers). Calculate the difference
between the numbersit is the effect of the word on the response to the post.
Report two lists of words that cause the strongest responses: those with an effect higher
than mean+std or lower than mean-std.
Apply VADER to each of the lists and compare the reported sentiment levels

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!