Question: First, research / gather the data: 1 . ChooseoneStackExchangesitedealingwithtopicsthatyoufindinteresting;seehttps: / / stackexc hange.com / sites ? view = list#traffic for a list. The site cannot

First, research/gather the data:
1. ChooseoneStackExchangesitedealingwithtopicsthatyoufindinteresting;seehttps://stackexc hange.com/sites?view=list#traffic for a list. The site cannot be too small, but also avoid selecting any of the largest ones (especially StackOverflow, Mathematics) unless you really want to challenge yourself. As a rule of thumb, lets say that the site must have at least 10,000 questions and 10,000 answers.
This document was originally developed by Dr. Marek Gagolewski. It was subsequently revised by Dr. Yang Li (Kelvin) during the work at the School of Information Technology, Deakin University, for the unit SIT220/731 Data Wrangling, Trimester 1,2024.
2. Downloadthesitesmostrecentdatadumpfromhttps://archive.org/details/stackexchange/.
3. Readthedescriptionofallthedatatablespublishedathttps://meta.stackexchange.com/questio
ns/2677/.
Then, create a single Quarto .qmd file1 that you will be rendering to a PDF report (how to do that you will
have to learn yourself this is part of this HD-level task), where you perform what follows.
1. Convertallthedatatables(Badges,Comments,PostHistory,PostLinks,Posts,Tags,Users,Votes) from XML to CSV, using custom code that you write yourself. Ideally, you should write a Python function that takes a single input file name (.xml) and output file name (.csv) and performs the conversion of a single dataset.
2. LoadtheCSVfilesaspandasdataframes.
3. Createatleastfivenontrivialdatavisualisationsand/ortables,atleastthreeofwhicharebasedon the extraction of information from text (e.g., tags, keywords, locations, etc.). You must demon- strate that you have learned how to write your own regular expressions (regexes).
4. Drawinsightfulandinterestingconclusions.Donotforgettoreflectonthepotentialdataprivacy and ethics issues that arise during the data analysis process.
This HD-level task is purposely under-defined you will not be told precisely what to do. Your aim is to generate some interesting insights into data featuring lots of textual information.
In the course of the report preparation, you should apply a wide range of data frame wrangling and text processing techniques. In particular, you must demonstrate that you mastered regular expressions.
Do not use pie charts (as we discussed during the lecture). Go beyond the basic plots that we have covered in this course. Draw at least one map (e.g., of the world) and a word cloud.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!