Question: Python. In this question you are tasked with writing a modified version of the web crawler that we covered in class. Your code should visit

Python. In this question you are tasked with writing a modified version of the web crawler that we covered in class.

Your code should visit child URLs only if the parent webpage contains predefined key words. For this purpose, define the score of an URL as the number of occurrences of key words in the HTML body. A higher score implies more relevance to the key words. More specifically, make the following modifications:

1. Create dictionaries named urls and seen as well as a list or dictionary for opened

2. Populate urls with the single given url with an initial score of 1.

3. While the length of urls is greater than 0 and the length of opened is less than

maxNumUrl:

a) Remove the highest scoring url from urls and assign it to curr_url

b) Add this curr_url to opened if it can be opened

c) If this url cannot be opened, simply continue, otherwise assign the urls text body to htmltext

d) Count how many occurances of the words in our list of keywords are in htmltext

e) Add this url with its score to seen

f) If the score is greater than zero, find all child href tags that are not in seen and add them along with the current (parent) score to urls

4. Once complete, print the number of URLs seen and opened.

5. Print the top 25 scoring URLs seen.

Your code should have the following settings:

url="https://www.hbs.edu/Pages/default.aspx" # initial URL

maxNumUrl=50 # maximum number of URLs to visit

keywords=['finance', 'engineering', 'business', 'research']

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Instructions for submission One of the topics covered in Analysis of Algorithms are algorithms for traversing graphs. The structure of the world-wide-web is an example of a directed graph with each...

Python and most Python libraries are free to download or use, though many users use Python through a paid service. Paid services help IT organizations manage the risks associated with the use of...

Read: Chapter Nineof Introduction to Documentary , Third Edition by Bill Nichols. 9 HOW CAN WE WRITE EFFECTIVEL ABOUT DOCUMENTARY? This chapter assumes some general familiarity with cway writing for...

Programming Assignment #1: Six Days of Christmas (15 points) Program Description: (Comment on each line to indicate what the code is doing ) This program tests your understanding of using static...

A creative engineer suggests structuring the TLB so that not all the bits of the presented address need match to result in a hit. Suggest how this might be achieved, and what might be the costs and...

Hi, I'm having an issue with this Simnet project and I was wondering if anyone could help me. There is only one issue: Step 3a, in the instruction pdf that I have attached. When my project is graded...

\fThis is an electronic version of the print textbook. Due to electronic rights restrictions, some third party content may be suppressed. Editorial review has deemed that any suppressed content does...

PLG-120 Week 4 Lecture Notes To begin your analysis of the legal issue you have been asked to address, you must identify the governing rule of law. Articulating the legal rule is, of course, the...

Please read chapter 5 and answer the questions and see the ( guide to answer number 3) For each case study, you will view the material as the student's teacher, read the information provided and...

Googles ease of use and superior search results have propelled the search engine to its num- ber one status, ousting the early dominance of competitors such as WebCrawler and Infos- eek. Even later...

A researcher uses a repeated-measures ANOVA to evaluate the results from a research study and reports an F-ratio with df = 2, 30. a. How many treatment conditions were compared in the study? b. How...

The income statement is like a moving picture; in contrast, a balance sheet is like a snapshot. Explain.

a callable bond may be structures to pay bondholders the current value of the bond on the date of call, is frequently called at price that is less than the par value

Haynes, Inc., obtained 100 percent of Turner Company's common stock on January 1, 2017, by issuing 8,300 shares of $10 par value common stock. Haynes's shares had a $15 per share fair value. On that...

KEY QUESTION To the right are hypothetical production possibilities tables for New Zealand and Spain. Each country can produce apples and plums. Plot the production possibilities data for each of the...

Other things equal, who is more likely to be a union member: Stephen, who works as a salesperson at a furniture store, or Susan, who works as a machinist for an aircraft manufacturer? Explain.

Explain how discrimination reduces domestic output and income. Demonstrate that loss using production possibilities analysis.