Question: Python makes it easy to access websites programmatically which can be very useful for applications such as web crawling or interacting with web sites from

Python makes it easy to access websites programmatically which can be very useful for applications such as web crawling or interacting with web sites from a Python program. See chapter 12 of the Severance text.

Your assignment is to write a Python script that is able to read an arbitrary URL provided by the user and count the number of unique web links on that page. For this purpose, you may assume that a web link can be identified with a regular expression that recognizes http and https references, e.g.:

http://my.favorite.url (Links to an external site.)Links to an external site.

HTTP://MY.FAVORITE_URL.ORG (Links to an external site.)Links to an external site.

https://123.456.788.111/foo.txt

HTTPS://foo.bar (Links to an external site.)Links to an external site.

You should NOT include local file references in your distinct count, e.g.

ews/guitar-great-carlos-alomar-technology-offers-new-opportunities-todays-musicians

/events/innovation-expo-2016

/events/alumni-weekend-2016

URLs occur in a variety of contexts in some sites including:

For this assignment, you may assume that a URL begins with double quote followed by "http" and ends with a double quote. This won't catch unusual cases like

meta name="generator" content="Drupal 7 (http://drupal.org)" />

but don't worry about that case. (Don't forget that HTTP and http are equivalent).

You may also assume that the arguments to the URL should be included in the URL when counting distinct values. E.g.

"http://www.google.com/mapss (Links to an external site.)Links to an external site.?login=jrr and "http://www.google.com/maps (Links to an external site.)Links to an external site.?login=rcohen2 should be counted as two distinct URLs. You do NOT need to strip the arguments from the end of the URL when detecting duplicates

The purpose of this assignment is for you to understand how to interact with web sites from Python and to get more experience with regular expressions. I don't expect you to you write the ultimate regular expression for detecting valid URLs.

How many distinct http/https references do you find on www.google.com? (Links to an external site.)Links to an external site. How about www.mtv.com? (Beware, website change frequently so don't be surprised if your answer changes over time.)

HINTS:

There are a number of Python packages for parsing web pages but for this assignment, please use the urllib and regular expression packages to implement your solution.

The request module in urllib is of great value here.

You may need to specify 'http://www.google.com' rather than 'www.google.com' in your urllib.request() call.

Some websites require Secure Socket Layer certificates. If you encounter one of these (such as www.stevens.edu (Links to an external site.)Links to an external site.), try a different website.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

MKT500 Week 8 Discussion Board 8 attached are the questions, scenario, and supportive reading materials. MKT500 DB 8 "The Importance of Social Media and Web Analytics" 1.) From the scenario,...

Python and most Python libraries are free to download or use, though many users use Python through a paid service. Paid services help IT organizations manage the risks associated with the use of...

Describe the types of cybercrimes facing organizations and critical infrastructures, explain the motives of cybercriminals, and evaluate the financial Explain both low-tech and high-tech methods...

COURT OF APPEAL FOR BRITISH COLUMBIA Citation: Equustek Solutions Inc. v. Google Inc., 2015 BCCA 265 Date: 20150611 Docket: CA41923 Between: Equustek Solutions Inc., Robert Angus and Clarma...

BBUS 4701: Business Policy and Strategy Yahoo 2012 Annual Report Assignment 4 By Yahoo! Yahoo!. (2012). Yahoo 2012 Annual Report. Retrieved from: http://investor.yahoo.net/annuals.cfm 2 01 2 A N N UA...

4 easy accounting questions and a comfortable due date. Sorry I can't offer any more tutor credit. Thanks in advance! :-) Question 1: A few years ago, a publishing company in the fourth quarter had a...

Hi Sara Hope you are doing well. I need to prepare Ratio Analysis for eBay company and Amazon. I uploaded the most recent 10-K report for the most recent years, 2014 to 2016. Please let me know if...

Part I : There are many external users of the financial accounting information about a company. Investors, creditors, authorities, vendors, employees, and customers may all be impacted by the...

(a) Find a polynomial interpolant to the data (- 2, - 10), (- 1, 2), (1, 2), (2, 14) using the Vandermonde system approach from 1. (b) Repeat using the Newton approach from 4. (c) Use 5 to get your...

(1 point) Consider the initial value problem for 0

true or false: A significant deficiency in internal controls is defined as a deficiency, or a combination of deficiencies, that result in a reasonable possibility that a material misstatement would...

You invest $8,000 in an on-line bookstore in the form of a corporation on Jan. 2. For the sake of simplicity, you are the only shareholder who owns this corporation. You also decide to take on...