Question: Python web crawler: import urllib2 from bs4 import BeautifulSoup import time def crawl(seeds): frontier = [seeds] visited_urls = set() for crawl_url in frontier: print Crawling:,

Python web crawler:

import urllib2

from bs4 import BeautifulSoup

import time

def crawl(seeds):

frontier = [seeds]

visited_urls = set()

for crawl_url in frontier:

print "Crawling:", crawl_url

visited_urls.add(crawl_url)

try:

resp = urllib2.urlopen(crawl_url)

except:

print "Could not access ", crawl_url

continue

content_type = resp.info().get('Content-Type')

if not content_type.startswith('text/html'):

print "Skipping %s content" % content_type

continue

contents = resp.read()

soup = BeautifulSoup(contents)

discovered_urls = set()

links = soup('a') # Get all anchor tags

for link in links:

if ('href' in dict(link.attrs)):

url = urllib2.urlparse.urljoin(crawl_url, link['href'])

if (url[0:4] == 'http' and url not in visited_urls

and url not in discovered_urls and url not in frontier):

discovered_urls.add(url)

frontier += discovered_urls

time.sleep(2)

Assignment:

Add an optional parameter limit with a default of 10 to crawl() function which is the maximum number of web pages to download. Save files to pages dir using the MD5 hash of the pages URL.

import hashlib filename = 'pages/' + hashlib.md5(url.encode()).hexdigest() + '.html'.

Only crawl URLs that are in landmark.edu domain (*.landmark.edu)

Use a regular expression when examining discovered links.

import re p = re.compile('ab*') if p.match('abc'): print("yes").

Coded in python.

Primary data structures:

Frontier

Links that have not yet been visited

Implement as a list to simulate a queue

Visited

Links that have been visited

Implement as a set to quickly check for inclusion

Discovered

Links that have been discovered

Implement as a set to quickly check for inclusion

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

import urllib2 from bs4 import BeautifulSoup import time def crawl(seeds): frontier = [seeds] visited_urls = set() for crawl_url in frontier: print "Crawling:", crawl_url visited_urls.add(crawl_url)...

Instructions for submission One of the topics covered in Analysis of Algorithms are algorithms for traversing graphs. The structure of the world-wide-web is an example of a directed graph with each...

Write a simple python program to crawl 10 pages from the web and store the text content in the pages into csv file. You code, the 10 html pages, and the csv file which store the content in the pages...

i need help implementing the dfs and bfs function along with everything else in terminal as I do not know how to could you post your terminal window and implementing it here pic of indentation import...

hi can someone help me compile this is terminal crawler.py from timeout import Timeout from db import Database from bs4 import BeautifulSoup import re import urllib2 import collections import sys...

This is a PYTHON question. I do not know anyother language. Problem: Make a getContent function, that returns the text-content of a webpage. online the text no tags from urllib.request import urlopen...

Python 3: Develop a crawler that collects the email addresses in the visited web pages. You can use function emails() from Problem 11.22 to find email addresses in a web page. To get your program to...

Project Description Web Crawler For this project, you will be completing a web crawler. The web crawler will start at some web page, follow a random link on that page, then follow a random link on...

Python: Please help with the following Assignment Thank you so much: EXPECTED RESULTS: LAB7: To get you started, I've provided you the beginning part of the program. Analyze the program to understand...

Can you Help me, please !! Project Description Web Crawler For this project, you will be completing a web crawler. The web crawler will start at some web page, follow a random link on that page, then...

1. Identify the three most important political issues/problems/events of the day. 2. Summarize your events and explain what makes each issue important. Discuss their implications for American...

On March 31, Mr. R quit his job with MT Inc. and began a new job with PK Company. His salary from MT was $82,600, and his salary from PK was $93,000. Compute his excess Social Security tax...

188,809 is the level of six packs to break even on the anticipated incremental costs

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

=+Understand the fi eld of comparative IHRM.

=+j Describe the institutional, economic, and cultural context for IHRM in different regions.

=+j Understand different types of regions in the world.