Question: urllib module urllib is the python library that helps with urls. You should view the documentation for the module, since you never know when youll

urllib module urllib is the python library that helps with urls. You should view the documentation for the module, since you never know when youll find something useful, but for what we need, it is straightforward.

import urllib.request # Import should be done at the top of your program

request = urllib.request.Request("http://cnn.com") # First create a request object

response = urllib.request.urlopen(request) # Create a response object # after we open the request.

page_data = response.read() # page_data has the text

Page_str = page_data.decode(utf-8) # convert the byte text to a # utf-8 string

response.close() # Remember to close the response

You should play with this code and get comfortable using it. Come to think of it, since youll be calling this with multiple urls over and over, it might make a great function. HINT. Its a great thing to functionalize. Seriously, you should just write it now. What happens when you pass a bad URL to the request? If it creates an error you probably want to use our error handling powers to solve that issue.

How can we tell what an email is?

Were going to be looking for portions of text that start with mailto: Including the colon. The email address follows that. How do we know where the email address stops? It stops when you reach any character that is not .@&#; or digits 0-9, or any alpha character a-z upper or lower. This isnt the most resilient way, but it will give you some good practice working with strings. The strings you are going to get from websites will be extremely large. So making a function that you can pass smaller strings to and experiment with is crucial to debugging and finding errors in a timely and efficient manner. Encoded Emails Some email addresses are encoded. If you get an email address that is encoded then youll want to parse it and create the real email address. webmaster@umk&# 099;.edu This email address is html encoded. Weve already seen that characters are a decimal number. chr(119) # Returns w chr(101 # returns e Decoding this entire string like this would result in an email address of webmaster@umkc.edu Clearly when you have an email address that looks like this you are looking for sections that start with &# and have a number up to 3 digits and then a semicolon. Again, you may find this easier to write a function to do this one thing by itself and return an unencoded string.

Our programs goals

We want to write a program to ask the user for a file that has URLs in it. One url on each line. If the user gives us a file that doesnt exist, or cant be opened then you must be able to handle those errors. Once you have a file, open each url and get the contents, find all the email addresses. Once you are done eliminate the duplicates and ask the user for a file to write out the email addresses to.

Program Specifications

The requires are below, but an additional requirement has to be observed. There are many tools that can make much of this easier to do. In fact many of them make it trivially easy. This isnt a course about finding and using libraries and modules, so youll be stuck using strings and your wits ( besides urllib of course ). However, you may be interested once youve solved it to look at 3rd party modules like BeautifulSoup ( terrible name ). It helps in parsing and working with HTML and XML. Another built-in module that is quite useful is re or regular expressions. Spending some time learning regular expressions at some point will pay off for you. Regular expressions are extremely powerful, flexible and useful for validating data and finding matching strings. Another module that is useful for unescaping encoded email addresses below is cgi.html.unescape which is built in. Learning to do these things by hand will apy off later when you dont have a tool that can do it for you. These are the skills that will allow you to build your own solutions.

In summary you are not allowed to use

Beautiful Soup

re ( regular expressions )

cgi Any imported module other than urllib

Sample Program

Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information.

>>> =============================== RESTART ================================

>>>

Welcome to email scraper!

Enter the filename containting URLs to read ==> invalid.txt

Could not open the file invalid.txt. It doesn't exist.

Enter the filename containting URLs to read ==> subdir

Could not open the file subdir. There was an IOError

Enter the filename containting URLs to read ==> emails.txt

Enter a file to save the emails to ==> output.txt

Do you want to run this application again? Y/YES/N/NO ==> e

You must enter only Y/YES/N or NO only.

Do you want to run this application again? Y/YES/N/NO ==> y

Welcome to email scraper!

Enter the filename containting URLs to read ==> emails2.txt

We did not find any emails in the provided urls to save

Do you want to run this application again? Y/YES/N/NO ==> y

Welcome to email scraper!

Enter the filename containting URLs to read ==> emails3.txt

sce.umkc.edu does not seem to be a valid url

invalid_url does not seem to be a valid url

We did not find any emails in the provided urls to save

Do you want to run this application again? Y/YES/N/NO ==> n

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

NEEDS TO BE CODED IN Python 3 Need both code and algorithm: Deliverables : You must use functions to modularize your work in a logical way. You should use exception handling where necessary as well....

Rev.Confirming Pages C H A P T E R 7 Planning, Composing, and Revising Chapter Outline The Ways Good Writers Write Activities in the Composing Process Using Your Time Effectively Brainstorming,...

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

Linux You are tell me the difference from the outcome of Lecture 3 Step 2 (Compare man tar to pinfo tar) click on Submit Assignment and type in your response. You can perform this assignment in...

Module Case Study Information A Module Case Study is a critical analysis and evaluation of a specific case or subject. For this course a Module Case Study must: Be two pages in length, double-spaced....

Project Management Casebook David I. Cleland, Karen M. Bursic, Richard Puerzer, and A. Yaroslav Vlasak Library of Congress Cataloging-in-PublicationData Project management casebook /edited by David...

5.4 General Methodology-Related Considerations 5.4.1 Planning an Analytics Project A critical success factor in technical projects, particularly where there is any element of exploration and...

There are two problems due this week (each worth 35 points) as follows. Problem 1.6 (page 20) In comprehensive paragraphs, answerrequirements a to e. You will have 5 paragraphs total of four to five...

#csc220a2.py def flatten(data): return () # csc220a2_tester.py """ Tester for Assignment 2. """ ############################################################### # Auto Grader (See "A Generic Python...

Discuss how a [Chief Information Officer] CIO might handle ethical decision making for an information technology issue (of your choosing) based on your reading of Module 1: Introduction to Ethical...

What are business ethics and why is this an important topic?

A climatologist claims that half of Canada's largest towns and cities receive more than 200 cm of snowfall each year. Refer to Data Set 4 in Appendix B and construct the 95% confidence interval for...

2. Briefly describe the planning and control process.

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

Was the person thoughtful about the alternatives, and did he or she make a reasonable and appropriate choice given the information available?

Is the person willing to deal with the consequences?

20. The best way of handling power and organizational politics is to stay away from it.