Question: Python. In this question you are tasked with writing a modified version of the web crawler that we covered in class. Your code should visit

Python. In this question you are tasked with writing a modified version of the web crawler that we covered in class.

Your code should visit child URLs only if the parent webpage contains predefined key words. For this purpose, define the score of an URL as the number of occurrences of key words in the HTML body. A higher score implies more relevance to the key words. More specifically, make the following modifications:

1. Create dictionaries named urls and seen as well as a list or dictionary for opened

2. Populate urls with the single given url with an initial score of 1.

3. While the length of urls is greater than 0 and the length of opened is less than

maxNumUrl:

a) Remove the highest scoring url from urls and assign it to curr_url

b) Add this curr_url to opened if it can be opened

c) If this url cannot be opened, simply continue, otherwise assign the urls text body to htmltext

d) Count how many occurances of the words in our list of keywords are in htmltext

e) Add this url with its score to seen

f) If the score is greater than zero, find all child href tags that are not in seen and add them along with the current (parent) score to urls

4. Once complete, print the number of URLs seen and opened.

5. Print the top 25 scoring URLs seen.

Your code should have the following settings:

url="https://www.hbs.edu/Pages/default.aspx" # initial URL

maxNumUrl=50 # maximum number of URLs to visit

keywords=['finance', 'engineering', 'business', 'research']

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!