Question: in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import

in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import json def fetchFromURL(url): """ fetch content from URL via HTTP GET request. """ try: with closing(get(url, stream=True)) as resp: if is_good_response(resp): return resp.content else: print("Error retrieving information") except RequestException as e: log_error('Error during request to {0}:{1}'.format(url, str(e))) def is_good_response(resp): """ Returns true if response looks like HTML """ content_type = resp.headers['Content-Type'].lower() return (resp.status_code == 200 and content_type is not None and content_type.find('html') > -1) def log_error(e): """ log the errors or you'll regret it later... """ print(e) def main(): url = ('http://shakespeare.mit.edu') #print("Hello World") rawHTML = fetchFromURL(url) #print(rawHTML) soup = BeautifulSoup(rawHTML, 'html.parser') #print(soup) f = open('main.html', 'wb') f.write(rawHTML) f.close() f = open('main.html', 'r') data = f.read() #print(data) soup = BeautifulSoup(data, 'html.parser') extractedText = soup.get_text() tokenizedText = word_tokenize(extractedText) type(tokenizedText) print(tokenizedText) reg = re.compile('[^a-zA-Z]') term = reg.sub('', extractedText) main()

add the following

  • Once we have the documents, we want to strip out the html markup (using Beautiful soup) so that we have the raw text and tokenize it using the NLTK library.
  • Once we have our base tokenization, we need to perform our normalization steps (like stripping out or eliminating tokens or special characters we dont want to include, case folding, and sorting etc.).
  • Finally, we need to add our resulting normalized tokens to our index along with their corresponding postings list and output our resulting index as a JSON file.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!