Question: Python. In this question you are tasked with writing a modified version of the web crawler that we covered in class. Your code should visit
Python. In this question you are tasked with writing a modified version of the web crawler that we covered in class.
Your code should visit child URLs only if the parent webpage contains predefined key words. For this purpose, define the score of an URL as the number of occurrences of key words in the HTML body. A higher score implies more relevance to the key words. More specifically, make the following modifications:
1. Create dictionaries named urls and seen as well as a list or dictionary for opened
2. Populate urls with the single given url with an initial score of 1.
3. While the length of urls is greater than 0 and the length of opened is less than
maxNumUrl:
a) Remove the highest scoring url from urls and assign it to curr_url
b) Add this curr_url to opened if it can be opened
c) If this url cannot be opened, simply continue, otherwise assign the urls text body to htmltext
d) Count how many occurances of the words in our list of keywords are in htmltext
e) Add this url with its score to seen
f) If the score is greater than zero, find all child href tags that are not in seen and add them along with the current (parent) score to urls
4. Once complete, print the number of URLs seen and opened.
5. Print the top 25 scoring URLs seen.
Your code should have the following settings:
url="https://www.hbs.edu/Pages/default.aspx" # initial URL
maxNumUrl=50 # maximum number of URLs to visit
keywords=['finance', 'engineering', 'business', 'research']
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
