Question: in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import

in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import json def fetchFromURL(url): """ fetch content from URL via HTTP GET request. """ try: with closing(get(url, stream=True)) as resp: if is_good_response(resp): return resp.content else: print("Error retrieving information") except RequestException as e: log_error('Error during request to {0}:{1}'.format(url, str(e))) def is_good_response(resp): """ Returns true if response looks like HTML """ content_type = resp.headers['Content-Type'].lower() return (resp.status_code == 200 and content_type is not None and content_type.find('html') > -1) def log_error(e): """ log the errors or you'll regret it later... """ print(e) def main(): url = ('http://shakespeare.mit.edu') #print("Hello World") rawHTML = fetchFromURL(url) #print(rawHTML) soup = BeautifulSoup(rawHTML, 'html.parser') #print(soup) f = open('main.html', 'wb') f.write(rawHTML) f.close() f = open('main.html', 'r') data = f.read() #print(data) soup = BeautifulSoup(data, 'html.parser') extractedText = soup.get_text() tokenizedText = word_tokenize(extractedText) type(tokenizedText) print(tokenizedText) reg = re.compile('[^a-zA-Z]') term = reg.sub('', extractedText) main()

add the following

Once we have the documents, we want to strip out the html markup (using Beautiful soup) so that we have the raw text and tokenize it using the NLTK library.
Once we have our base tokenization, we need to perform our normalization steps (like stripping out or eliminating tokens or special characters we dont want to include, case folding, and sorting etc.).
Finally, we need to add our resulting normalized tokens to our index along with their corresponding postings list and output our resulting index as a JSON file.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

IT LITERALLY WILL NOT KEEP AN INDENT HERE IT IS IN A DOC https://docs.google.com/document/d/1AzsBv1PeVX1vbJTmsi40NrWXHitNs3qQU61nl8pQjTo/edit?usp=sharing i need help allowing a text file of whatever...

Help with Python code: I am doing my paper for C996 programming in python. I was able to write the code (don't know if it is correct). I have to extracts links that point to other HTML pages and...

hi can someone help me compile this is terminal crawler.py from timeout import Timeout from db import Database from bs4 import BeautifulSoup import re import urllib2 import collections import sys...

in this project, I will have to make aTextAnalyzer class. The methods of the class are described below. I will do my work in the Analyzing Text Jupyter notebook included in the project files. Be sure...

Need help with converting JSON to CSV (part 4) and then parts 5-7! ---- CODE: import os from requests import get import json import csv import ssl ssl._create_default_https_context =...

Help! Please help me with this python homework! Thank you so much!! And there are the two links used in the homework. http://www.ncaa.com/rankings/basketball-men/d1caa-mens-basketball-rpi...

Develop a REST server script named measurement_server.py. This script will respond to requests for data from the measurement database by sending back data that is JSON serialized. The server should...

Description: This exercise consists of three parts: the first is to write a function to download a single file, the second is to write code to download a set of images sequentially and finally the...

Hi. I just want to ask what is the mistake in my code, I follow the instruction. And somehow it's still not working ( can you write the correct answer down tysm) X IDLE Shell 3.9.1 File Edit Shell...

https://www.chegg.com/homework-help/questions-and-answers/pleas-give-server-client-code-pleas-dont-copy-others-sorry-bother-q107084158?new=true in this question i don't think the server code is...

To keep chickens warm in a chicken coop, why would a CFL be a poor choice compared with an incandescent bulb?

There are two cities, one with 100,000 people and the other with 1,000,000 people. The above problem shows that most people live in places that are more crowded than average. In each case, find the...

Question 1 0 4 pts Which of the following is a True statement about the Dow 3 0 Industrial Average It is a value - weighted average of 3 0 Industrial Stocks It is a price - weighted average of 3 0...

The first scenario will be a Verbal Judo scenario in which your scenario follows the standard Verbal Judo interaction: You need to ask somebody to modify their behavior either to do something or to...

Some have speculated that in addition to increasing the validity of decisions, employing rigorous selection methods has symbolic value for organizations. What message is sent to applicants about the...

Discuss how the following trends are changing the skill requirements for managerial jobs in the United States: (a) increasing use of computers, (b) increasing international competition.

Does it have correct contact information?