Question: Python web crawler: import urllib2 from bs4 import BeautifulSoup import time def crawl(seeds): frontier = [seeds] visited_urls = set() for crawl_url in frontier: print Crawling:,
Python web crawler:
import urllib2
from bs4 import BeautifulSoup
import time
def crawl(seeds):
frontier = [seeds]
visited_urls = set()
for crawl_url in frontier:
print "Crawling:", crawl_url
visited_urls.add(crawl_url)
try:
resp = urllib2.urlopen(crawl_url)
except:
print "Could not access ", crawl_url
continue
content_type = resp.info().get('Content-Type')
if not content_type.startswith('text/html'):
print "Skipping %s content" % content_type
continue
contents = resp.read()
soup = BeautifulSoup(contents)
discovered_urls = set()
links = soup('a') # Get all anchor tags
for link in links:
if ('href' in dict(link.attrs)):
url = urllib2.urlparse.urljoin(crawl_url, link['href'])
if (url[0:4] == 'http' and url not in visited_urls
and url not in discovered_urls and url not in frontier):
discovered_urls.add(url)
frontier += discovered_urls
time.sleep(2)
Assignment:
Add an optional parameter limit with a default of 10 to crawl() function which is the maximum number of web pages to download. Save files to pages dir using the MD5 hash of the pages URL.
import hashlib filename = 'pages/' + hashlib.md5(url.encode()).hexdigest() + '.html'.
Only crawl URLs that are in landmark.edu domain (*.landmark.edu)
Use a regular expression when examining discovered links.
import re p = re.compile('ab*') if p.match('abc'): print("yes").
Coded in python.
Primary data structures:
Frontier
Links that have not yet been visited
Implement as a list to simulate a queue
Visited
Links that have been visited
Implement as a set to quickly check for inclusion
Discovered
Links that have been discovered
Implement as a set to quickly check for inclusion
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
