Question: Dataset: The posts.csv file holds actual user posts, with each post consisting of eight fields: ` id ` ( a unique identifier, int ) ,
Dataset: The posts.csv file holds actual user posts, with each post consisting of eight fields: ida
unique identifier, intblogq unique identifier of the community to which the post had been
submitted, intuserthe authors unique identifier, intpostedthe date of the post, string in
the YYYYMMDD formaturlthe URL of the post, stringcommentsthe number of
comments, inttitlethe post title, string and bodythe post body, string, possibly with
HTML
Instructions:
Data Cleaning and Preparation:
Import the posts.csv file using Pandas. Treat the N strings as missing values. Read the
posted column as date use parsedates and dateformat options
Save the cleaned DataFrame to a pickle file named "posts.p Ensure that your script loads
the clean version of the file from the cache if it exists tryexcept FileNotFoundError
Data Analysis and Visualization:
Report and record the number of posts and unique users, and the fraction of posts
without comments.
Calculate and scatterplot the distribution of number of comments in logarithmic x and y
coordinates. The xaxis shall be the number of comments, and the yaxis shall be the
number of posts with that many comments. Save the image as postszipf.pdf Hint: Add
to the number of comments to avoid missing posts without comments on the logarithmic
scale.
Calculate and plot the histogram of the logarithm of the plots bodies lengths. Save the
image as postslength.pdf
Calculate and create a horizontal bar chart showing the distribution of the number of posts
by calendar month. Label the bars with threeletter abbreviated months names. Save the
image as postsmonths.pdf
Content Analysis:
Preprocess each post body:
Remove HTML with Beautiful Soup
Tokenize the text into words
Calculate PartofSpeech tags for each word
Data Science Spring CMPSC
Suffolk University A Math & CS Department
Lemmatize the words
Convert the lemmas to lowercase and remove those on the stopwords list
Find the most frequently used lemmas use the Counter object
For each lemma, calculate the mean engagement the mean number of comments for the
posts that do and do not contain this lemma two numbers Calculate the difference
between the numbersit is the effect of the word on the response to the post.
Report two lists of words that cause the strongest responses: those with an effect higher
than meanstd or lower than meanstd
Apply VADER to each of the lists and compare the reported sentiment levels
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
