Question: urllib module urllib is the python library that helps with urls. You should view the documentation for the module, since you never know when youll
urllib module urllib is the python library that helps with urls. You should view the documentation for the module, since you never know when youll find something useful, but for what we need, it is straightforward.
import urllib.request # Import should be done at the top of your program
request = urllib.request.Request("http://cnn.com") # First create a request object
response = urllib.request.urlopen(request) # Create a response object # after we open the request.
page_data = response.read() # page_data has the text
Page_str = page_data.decode(utf-8) # convert the byte text to a # utf-8 string
response.close() # Remember to close the response
You should play with this code and get comfortable using it. Come to think of it, since youll be calling this with multiple urls over and over, it might make a great function. HINT. Its a great thing to functionalize. Seriously, you should just write it now. What happens when you pass a bad URL to the request? If it creates an error you probably want to use our error handling powers to solve that issue.
How can we tell what an email is?
Were going to be looking for portions of text that start with mailto: Including the colon. The email address follows that. How do we know where the email address stops? It stops when you reach any character that is not .@ or digits 0-9, or any alpha character a-z upper or lower. This isnt the most resilient way, but it will give you some good practice working with strings. The strings you are going to get from websites will be extremely large. So making a function that you can pass smaller strings to and experiment with is crucial to debugging and finding errors in a timely and efficient manner. Encoded Emails Some email addresses are encoded. If you get an email address that is encoded then youll want to parse it and create the real email address. webmaster@umk 099;.edu This email address is html encoded. Weve already seen that characters are a decimal number. chr(119) # Returns w chr(101 # returns e Decoding this entire string like this would result in an email address of webmaster@umkc.edu Clearly when you have an email address that looks like this you are looking for sections that start with and have a number up to 3 digits and then a semicolon. Again, you may find this easier to write a function to do this one thing by itself and return an unencoded string.
Our programs goals
We want to write a program to ask the user for a file that has URLs in it. One url on each line. If the user gives us a file that doesnt exist, or cant be opened then you must be able to handle those errors. Once you have a file, open each url and get the contents, find all the email addresses. Once you are done eliminate the duplicates and ask the user for a file to write out the email addresses to.
Program Specifications
The requires are below, but an additional requirement has to be observed. There are many tools that can make much of this easier to do. In fact many of them make it trivially easy. This isnt a course about finding and using libraries and modules, so youll be stuck using strings and your wits ( besides urllib of course ). However, you may be interested once youve solved it to look at 3rd party modules like BeautifulSoup ( terrible name ). It helps in parsing and working with HTML and XML. Another built-in module that is quite useful is re or regular expressions. Spending some time learning regular expressions at some point will pay off for you. Regular expressions are extremely powerful, flexible and useful for validating data and finding matching strings. Another module that is useful for unescaping encoded email addresses below is cgi.html.unescape which is built in. Learning to do these things by hand will apy off later when you dont have a tool that can do it for you. These are the skills that will allow you to build your own solutions.
In summary you are not allowed to use
Beautiful Soup
re ( regular expressions )
cgi Any imported module other than urllib
Sample Program
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information.
>>> =============================== RESTART ================================
>>>
Welcome to email scraper!
Enter the filename containting URLs to read ==> invalid.txt
Could not open the file invalid.txt. It doesn't exist.
Enter the filename containting URLs to read ==> subdir
Could not open the file subdir. There was an IOError
Enter the filename containting URLs to read ==> emails.txt
Enter a file to save the emails to ==> output.txt
Do you want to run this application again? Y/YES/N/NO ==> e
You must enter only Y/YES/N or NO only.
Do you want to run this application again? Y/YES/N/NO ==> y
Welcome to email scraper!
Enter the filename containting URLs to read ==> emails2.txt
We did not find any emails in the provided urls to save
Do you want to run this application again? Y/YES/N/NO ==> y
Welcome to email scraper!
Enter the filename containting URLs to read ==> emails3.txt
sce.umkc.edu does not seem to be a valid url
invalid_url does not seem to be a valid url
We did not find any emails in the provided urls to save
Do you want to run this application again? Y/YES/N/NO ==> n
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
