Question: from urllib.request import urlopen, urljoin from html.parser import HTMLParser from re import findall class Collector(HTMLParser): def _-init-(self): initializes parser, the url, and a list HTMLParser.__init

from urllib.request import urlopen, urljoin from html.parser import HTMLParser from re import findall class Collector(HTMLParser): def _-init-(self): initializes parser, the url, and a list HTMLParser.__init (self) self. linksset() - # Solution to Practice Problem 11.3 self.text def handle_starttag(self, tag, attrs): #print(tag) collects hyperlink URLs in their absolute format if tag = a: for attr in attrs: if att r [0] = href: self. links.add (attr[1]) for a in. def getLinks (content): parserCollector) parser.feed (content) newLink = parser. links #https = [] #for link in newLink: #if link = href: Write a function getWebInfo( ) that takes as input a URL and calls three functions, to print the following information: 1) The set of all absolute links already in the page, that is links that start with 'http://'. Must use HTMLparser class methods. Do not copy code from the book, which is using urljoin to *make* every link absolute. (6 points) 2) A set that contains all e-mail addresses appearing in the page. Must use regular expressions to detect e-mail addresses on the web page. Must remove duplicates. E-mails should be matching general e-mails, not just depaul.edu emails; do not use cdm.depaul.edu in your pattern. Your program should work for any e-mail address on any web page. (6 points) 3) A list of tuples (derived from a dictionary) that contains the 20 most frequent words and their frequencies, in order of frequency. Words must contain 5 or more characters. Discard any words of 4 characters or less. (6 points) There are several steps to follow on this part. a) You need to construct a dictionary first, containing words and their frequencies. b) Then the dictionary has to be reversed. c) The reversed dictionary has to be sorted. Please note that the sorting method returns a list of tuples. d) Print the first 20 tuples of the list of tuples. Write one function for each of the three pieces of information, a total of three functions. Then assemble the three function calls (and headings) inside your main function getWebInfo( ). Include the call to getWebInfo at the bottom of your module (file.) (2 points) Function Calls getWebInfo ('http://www.cdm.depaul.edu/') getWebInfo ('http://www.cs.uiowa.edu/') In python, use from html.parser import HTMLParser and print absolute links

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!