Question: PYTHON 3 Upload a file called yourlastname.py to Canvas for this assignment. Note its a .py file and not .ipynb. Use Jupyter all you want
PYTHON 3
Upload a file called yourlastname.py to Canvas for this assignment. Note its a .py file and not .ipynb. Use Jupyter all you want for the assignment, but the goal is to create a file we can import to help us with text processing (normalization and tokenization) and inverted index construction. If you dont want to use Jupyter, VS Code or PyCharm are very good Python IDEs. Poorly organized/difficult to read files with code extraneous to the requirements of this assignment will lose points. Develop as messily as needed and then clean it up before handing in a final copy!
Create a class DocumentIndex to act as an abstract data type for the document index inverted index data structure. It should include the following member functions/support these operation on its data:
A normalize(term) method that takes a str object term and returns a stemmed and/or lemmatized, lowercase version of that word suitable for a key in the inverted index.
An update_entry(normalized_term, doc_id): method that adds the normalized str object normalized_term to the index if its not already in the index and records that the document with integral doc_id contains that term.
A tokenize(document) method that takes a document as a str object and returns a list of unnormalized tokens contained in that document. Use the nltk instead of split() for tokenization, and be sure to ignore stopwords.
An index_document(document, doc_id) method that takes a document as a str object with integral doc_id and adds a tokenized, normalized version of the document to the inverted index. Stopwords in the document are not indexed. Note that when the spec says a tokenized, normalized version of the document gets indexed, that doesnt imply this method implements all of that. It implies this method causes that to happen by calling other methods that already implement the needed functionality. Since the other methods above do carry out small tasks needed to index a document, they will be called by index_document().
A index_corpus(corpus) method that takes corpus as a list of str containing items that are the HTML of each document. Note that this corpus has no document ids, so use a documents index in the list as its ID here.
An object of type DocumentIndex should support the operator [term] for term lookup. In other words, if object ii was constructed via ii = DocumentIndex() and a suitable index built with index_corpus(), then ii[Kimmer] would return the set of document IDs containing the search term Kimmer. Hint: use a magic method
By default, if a term is not in the index, it should return the empty set.
There is a pickled list of documents you can use for testing. Its located in /usr/local/share/doc_index corpus.p. You can access it via pickle.open() on ada. Below is a starting point with example code:
import pickle
import ntlk.tokenize
corpus = pickle.load(open("/usr/local/share/doc_index_corpus.p","rb"))
class DocumentIndex: # You do this part
doc_index = DocumentIndex() doc_index.index_corpus(corpus) query = "" # come up with a query here!
normalized_query_tokens = [] for token in doc_index.tokenize(query): normalized_query_tokens.append(doc_index.normalize(token))
for term in normalized_query_tokens: print(doc_index[term])
Heres the code I used to create the test corpus. You can make your own corpus with different URLs and search terms if youd like.
import pickle import urllib.request import bs4
urls = ["https://starwars.fandom.com/wiki/Cloud_City", "https://screenrant.com/star-wars-bespin-facts/", "https://en.wikipedia.org/wiki/Cloud", "https://en.wikipedia.org/wiki/City" ]
corpus = [] for url in urls: with urllib.request.urlopen(url) as response: # request html_document = response.read() # read response from server soup = bs4.BeautifulSoup(html_document, "lxml") text_content = soup.get_text().replace(' ','').replace('\t','') corpus.append(text_content) pickle.dump(corpus,open("doc_index_corpus.p","wb"))
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
