Question: Text Mining Python Code Question is in attached screenshot. Please help to see. QUESTION In this assignment, we practice document similarity. First, we are going

Text Mining

Python Code Question is in attached screenshot. Please help to see.

QUESTION

In this assignment, we practice document similarity. First, we are going to scrapephilosophers' biographies from their Wikipedia page and construct a corpus of documents. Then, we'll match every philosopher to its most similar one based on their Wikipedia biographies.

We will useWikipedia's List of ancient Greekphilosopherspage (you can find ithere).First,we'll scrape the names of the authors and get their filepaths (Question 1). Then, wewill create function to get the content from an article with onlyits file path (Question 2). Finally, we'll build the LSI model to find the most similar other philosopher for each of the authors in our database(Question 3).

Question 1

Write function that takes the file name of the Wikipedia page containing the list of all Greek ancientphilosophers and returns a list of tuples containingthe name of the philosopher and the path to its individual article file.

To avoid scraping directly from Wikipedia, we downloaded a page (Index.html) and put it in your starting code. This file contains the HTML code of the page of the list ofancient Greek philosophers(that you can findhere).

Your function should beget_philosophers(filename)where 'filename'is thefilecontaining the HTMLof a page of a list of philosophers, 'Index.html' in our case. The function should return a list of tuples containing the name of the philosopher and its associated filepath. Note that all philosophers articles are in the 'Philosophers' folder, hence the path should contain 'Philosophers', '/', 'author_name' and '.html'.

The output should look like that:

[('Acrion', 'Philosophers/Acrion.html'), ('Adrastus of Aphrodisias', 'Philosophers/Adrastus of Aphrodisias.html'), ('Aedesia', 'Philosophers/Aedesia.html'), ('Aedesius', 'Philosophers/Aedesius.html'), ('Aeneas of Gaza', 'Philosophers/Aeneas of Gaza.html'), ('Aenesidemus', 'Philosophers/Aenesidemus.html'), ...]

Some additional advice:

One of the problems with Wikipedia is that page structure is not always consistent, and some cells in the table might be empty. There are many tags to look for in the actual table, but probably looking for the first occurrence of 'tr' is the safest option.

Question 2

Write a function that scrapes the text on a philosophers's pageand returns it as a text string.

As before, we saved a page in a file so that you don't have to scrape on Wikipedia, seePhilosophers/Acrion.html. Your function should be calledget_text(file).It takes thefilepath of the saved HTML file as an input and returns a string. Scrape all the text that is inside a  tag. Assuming that page_soup contains the entire bs4 object for the page, the following code should get you all the relevant text on the page:all_text = "" for tag in page_soup.find_all('p'): all_text += tag.get_text()

We downloaded all of thephilosophers' pages from the list, and extracted and saved their text. They are in the 'Philosophers' folder.

Question 3

Use the files under "Philosophers" folder to construct an LSI model.Then, use the LSI model to find the most similar philosopher for each of the philosophersfound in Question 1, based on the content of their Wikipedia articles. YouMUSTNOT goonline to scrape the data; everything you need is in your Jupyter notebook working directory.

The function should have as input the list of tuples created in Question 1.

The output format should be a list of tuples too. Each tuple should contain a philosopher's name and its most similar other philosopher. Please note both names can't be the same.

The output should look like that:

[('Acrion', 'Athenodoros Cananites'), ('Adrastus of Aphrodisias', 'Andronicus of Rhodes'), ('Aedesia', 'Ammonius of Athens'), ('Aedesius', 'Arete of Cyrene'), ('Aeneas of Gaza', 'Ammonius Hermiae'), ...] The name of the author must be the 'title' tag found in the 'Index.html'. Be careful, if itis not the case, the grader will not detect your answer and you'll get no points.

NOTE_1:For processing speed purposes, the table in "Index.html" has been shortened comparedto the one online on wikipedia.org. Do not worry if you do not find some philosophers inyour results, this is made on purpose.

There are some instructions you need to follow:

You only need to write code in the comment area "Your Code Here".

Do not upload your own file. Please make the necessary changes in the Jupyter notebook file already present in the server.

Please note, there are several cells in the Assignment Jupyter notebook that are empty and read only. Do not attempt to remove them or edit them. They are used in grading your notebook. Doing so might lead to 0 points.

PYTHON FILE - See in google drive link:

"""

Question 1

Write function that takes the file name of the Wikipedia page containing all Greek ancient

philosophers (saved as "Index.html" in your workspace) and returns a list tuples containing

the name of the philosopher and the path to its individual article file.

Example of use: get_philosophers("Index.html")

The output should be a list of tuples:

[('Acrion', 'Philosophers/Acrion.html'),

('Adrastus of Aphrodisias', 'Philosophers/Adrastus of Aphrodisias.html'),

('Aedesia', 'Philosophers/Aedesia.html'),

('Aedesius', 'Philosophers/Aedesius.html'),

('Aeneas of Gaza', 'Philosophers/Aeneas of Gaza.html'),

('Aenesidemus', 'Philosophers/Aenesidemus.html'),

...]

NOTE: For processing speed purposes, the table in "Index.html" has been shortened compared

to the one online on wikipedia.org. Do not worry if you do not find some philosophers in

your results, this is made on purpose.

"""

def get_philosophers(filename):

import codecs

from bs4 import BeautifulSoup

f = codecs.open(filename, 'r', 'utf-8')

soup = BeautifulSoup(f.read(),'lxml')

###

### YOUR CODE HERE

###

# Once done, try this:

filenames = get_philosophers("Index.html")

filenames

"""

Question 2

Write function that scrapes the text on a philosophers's page and returns it as a text

string. The input is the name of the file that contains the philosoph's page.

Example of use: get_text('Philosophers/Acrion.html')

should output the text of the page.

'Acrion was a Locrian and a Pythagorean philosopher...'

"""

def get_text(file):

###

### YOUR CODE HERE

###

# Once done, try this:

get_text("Philosophers/Acrion.html")

"""

Question 3

Use the files under "Philosophers" folder to construct an LSI model.

Then, use the LSI model to find the most similar philosopher for each of the philosophers

found in Question 1, based on the content of their Wikipedia articles. You should not go

online to scrape the data; everything you need is in your Jupyter notebook working directory.

The function should have as input the list of tuples created in Question 1.

The output format should be a list of tuples too. Each tuple should contain a philosopher's name

and its most similar other philosopher. Please note both names can't be the same.

The output should look like that:

[('Acrion', 'Athenodoros Cananites'),

('Adrastus of Aphrodisias', 'Andronicus of Rhodes'),

('Aedesia', 'Ammonius of Athens'),

('Aedesius', 'Arete of Cyrene'),

('Aeneas of Gaza', 'Ammonius Hermiae'),

...]

"""

def run(filenames):

###

### YOUR CODE HERE

###

# Once done, try this:

run(filenames)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Text Mining Python Code Question is in attached screenshot. Please help to see. Link attached again - Python Relevant Files accessible via this link, please check :...

In this assignment, we will practice document similarity. First, we are going to scrapephilosophers' biographies from their Wikipedia page and construct a corpus of documents. Then, we'll match every...

Need help getting started on these questions. I am supposed to add code where it says "implement me" and write the answer where it says answer in one or two line. Need to fill in the "Implement me"...

Background Information This assignment tests your understanding of and ability to apply the programming concepts we have covered throughout the unit. The concepts covered in the second half of the...

SQL Query Design Module Preview This guide contains a preview of the materials for all module courses in this series: Using a Relational Database Management System Single Table Query Commands...

ISFM-300 Case Study, Stage 2: Business Process Analysis and Functional Requirements Before you begin this assignment, be sure you: 1. Have completed all previously assigned readings, particularly...

Need to fill in all parts that say "Implement me" and answer in one or two lines here. The following cell contains code that will be referred to as the Preprocessing Block from now on. It contains a...

Please help me with this assignment, 100% human! Reference book George, J. M. (2024). Contemporary management (12th ed.). McGraw-Hill Education. keiser library Syahbinah, S., & Suhardianto, N....

Please help me to answer the attached files ................................... INVESTMENT AND PORTFOLIO MANAGEMENT COURSE ASSESSMENT 1 Submission deadline without penalties is 12 October 2017....

Hello, Sarah could you please help me Accountung Theory and practice course assignment? i have to do next week monday. Please i need your help i have to choose 3 articles below and apply acoounting...

Suppose you want to analyze personal consumption expenditures by using income. However, you also believe that personal consumption expenditures might vary by gender (female-male) and marital status...

33) For the equation: 2 KCIO3 (s) 2 KCl (s) + 3 02 (g), you start with some KCIO3 that decomposes into the products. At equilibrium, there is some solid remaining and the total pressure in the flask...

The policyowner pays for her life insurance annually. Until now, she has collected a nontaxable dividend check each year. She has decided that she would rather use the dividends to help pay for her...

A company issues 1,050 shares of its common stock for $33,600 cash. Prepare journal entries to record this event under each of the following separate situations.