Question: Must be done in Python 3 in Jupyter Notebook using Pandas Link to privacy.html needed for this question: https://mega.nz/#!Ur4BUKoJ!kelCZAdSDUv6tIltswcvNsh8KehhXkME03ZO-Zv_Vns Part 1: Install bs4/ BeautifulSoup ,
Must be done in Python 3 in Jupyter Notebook using Pandas
Link to privacy.html needed for this question: https://mega.nz/#!Ur4BUKoJ!kelCZAdSDUv6tIltswcvNsh8KehhXkME03ZO-Zv_Vns
Part 1:
Install bs4/BeautifulSoup, and give it a try on extracting just the text (and not the html) from the file privacy.html. This file is a simple web server landing page. Think of it as containing just a long string of characters. If you look at it in a text editor you'll see a lot of html tags. Share your code and your results.
Part 2:
Use Deldycke's html tag regex (link here: https://kevin.deldycke.com/2008/07/python-ultimate-regular-expression-to-catch-html-tags/ (Links to an external site.)) or another expression that you like better, with Pandas or by just using Python to strip out all the html from privacy.html. What's in this file is just a long string of characters, as mentioned above. Share your code and your results.
Part 3: Can you find other Python packages that think might be more useful or easier to use than the above?
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
