Question: Python 3: Develop a crawler that collects the email addresses in the visited web pages. You can use function emails() from Problem 11.22 to find

Python 3: Develop a crawler that collects the email addresses in the visited web pages. You can use function emails() from Problem 11.22 to find email addresses in a web page. To get your program to terminate, you may use the approach from Problem 11.15 or Problem 11.17.

(((For context, problem 11.22 is as follows: Write function emails() that takes a document (as a string) as input and returns the set of email addresses (i.e., strings) appearing in it. You should use a regular expression to find the email addresses in the document. >>> from urllib.request import urlopen >>> url = 'http://www.cdm.depaul.edu' >>> content = urlopen(url).read().decode() >>> emails(content) {'advising@cdm.depaul.edu', 'wwwfeedback@cdm.depaul.edu', 'admission@cdm.depaul.edu', 'webmaster@cdm.depaul.edu'}

Problem 11.15 is as follows: Modify the crawler function crawl1() so that the crawler does not visit web pages that are more than n click (hyperlinks) away. To do this, the function should take an ad- ditional input, a nonnegative integer n. If n is 0, then no recursive calls should be made. Otherwise, the recursive calls should pass n 1 as the argument to the crawl1() function.

Problem 11.17 is as follows: Modify the crawler function crawl2() so that the crawler only follows links hosted on the same host as the starting web page.

visited web pages. You can use function emails() from Problem 11.22 to )))

1 def crawl1 (url): "recursive web crawler that calls analyze () on every web page' # analyze () returns a list of hyperlink URLs in web page url links = analyze (url) 10 11 12 # recursively continue crawl from every link in links for link in links: try : # try block because link may not be valid HTML file crawl1 (link) except: # if an exception is thrown, pass # ignore and move on

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Python 3: Develop a crawler that collects the email addresses in the visited web pages. You can use function emails() from Problem 11.22 (below) to find email addresses in a web page. Design it so...

from urllib.request import urlopen, urljoin from html.parser import HTMLParser from re import findall class Collector(HTMLParser): def _-init-(self): initializes parser, the url, and a list...

1. A. Online spreadsheets, word processors, and other office productivity tools blur the line between websites and traditional software. In doing so, they provide both opportunities and challenges...

There are two problems due this week (each worth 35 points) as follows. Case 5-1David L. Miller: Portrait of a White-Collar Criminal (page 144). In comprehensive paragraphs, answerrequirements 1?6....

I have to create a program in C and I can't figure it out. The program has to read a source file. Please help. /******************************************************************** PROJECT: Glossary...

Googles ease of use and superior search results have propelled the search engine to its num- ber one status, ousting the early dominance of competitors such as WebCrawler and Infos- eek. Even later...

Project Description Web Crawler For this project, you will be completing a web crawler. The web crawler will start at some web page, follow a random link on that page, then follow a random link on...

Can anyone help me with this, please :( !! It is due tomorrow !! Project Description Web Crawler For this project, you will be completing a web crawler. The web crawler will start at some web page,...

NEEDS TO BE CODED IN Python 3 Need both code and algorithm: Deliverables : You must use functions to modularize your work in a logical way. You should use exception handling where necessary as well....

COURT OF APPEAL FOR BRITISH COLUMBIA Citation: Equustek Solutions Inc. v. Google Inc., 2015 BCCA 265 Date: 20150611 Docket: CA41923 Between: Equustek Solutions Inc., Robert Angus and Clarma...

What does stewardship mean and what is its role in an information system?

What is the difference between an active and a passive bond portfolio strategy?

Question 7 . ( 3 5 pts ) A couple takes out a $ 5 0 0 , 0 0 0 mortgage from their favorite bank. They plan on paying it off over 2 5 years through bi - weekly payments. They opt for a fixed - rate, 5...

8:37 * N. 80% i ... OBJECTIVES: Create relationships Create a Pivot Table from Related Tables Create a PivotChart Modify the PivotChart The major section in this chapter :ontinuation is: Data...