Question: PYTHON 3 Upload a file called yourlastname.py to Canvas for this assignment. Note its a .py file and not .ipynb. Use Jupyter all you want

PYTHON 3

Upload a file called yourlastname.py to Canvas for this assignment. Note its a .py file and not .ipynb. Use Jupyter all you want for the assignment, but the goal is to create a file we can import to help us with text processing (normalization and tokenization) and inverted index construction. If you dont want to use Jupyter, VS Code or PyCharm are very good Python IDEs. Poorly organized/difficult to read files with code extraneous to the requirements of this assignment will lose points. Develop as messily as needed and then clean it up before handing in a final copy!

Create a class DocumentIndex to act as an abstract data type for the document index inverted index data structure. It should include the following member functions/support these operation on its data:

A normalize(term) method that takes a str object term and returns a stemmed and/or lemmatized, lowercase version of that word suitable for a key in the inverted index.

An update_entry(normalized_term, doc_id): method that adds the normalized str object normalized_term to the index if its not already in the index and records that the document with integral doc_id contains that term.

A tokenize(document) method that takes a document as a str object and returns a list of unnormalized tokens contained in that document. Use the nltk instead of split() for tokenization, and be sure to ignore stopwords.

An index_document(document, doc_id) method that takes a document as a str object with integral doc_id and adds a tokenized, normalized version of the document to the inverted index. Stopwords in the document are not indexed. Note that when the spec says a tokenized, normalized version of the document gets indexed, that doesnt imply this method implements all of that. It implies this method causes that to happen by calling other methods that already implement the needed functionality. Since the other methods above do carry out small tasks needed to index a document, they will be called by index_document().

A index_corpus(corpus) method that takes corpus as a list of str containing items that are the HTML of each document. Note that this corpus has no document ids, so use a documents index in the list as its ID here.

An object of type DocumentIndex should support the operator [term] for term lookup. In other words, if object ii was constructed via ii = DocumentIndex() and a suitable index built with index_corpus(), then ii[Kimmer] would return the set of document IDs containing the search term Kimmer. Hint: use a magic method

By default, if a term is not in the index, it should return the empty set.

There is a pickled list of documents you can use for testing. Its located in /usr/local/share/doc_index corpus.p. You can access it via pickle.open() on ada. Below is a starting point with example code:

import pickle

import ntlk.tokenize

corpus = pickle.load(open("/usr/local/share/doc_index_corpus.p","rb"))

class DocumentIndex: # You do this part

doc_index = DocumentIndex() doc_index.index_corpus(corpus) query = "" # come up with a query here!

normalized_query_tokens = [] for token in doc_index.tokenize(query): normalized_query_tokens.append(doc_index.normalize(token))

for term in normalized_query_tokens: print(doc_index[term])

Heres the code I used to create the test corpus. You can make your own corpus with different URLs and search terms if youd like.

import pickle import urllib.request import bs4

urls = ["https://starwars.fandom.com/wiki/Cloud_City", "https://screenrant.com/star-wars-bespin-facts/", "https://en.wikipedia.org/wiki/Cloud", "https://en.wikipedia.org/wiki/City" ]

corpus = [] for url in urls: with urllib.request.urlopen(url) as response: # request html_document = response.read() # read response from server soup = bs4.BeautifulSoup(html_document, "lxml") text_content = soup.get_text().replace(' ','').replace('\t','') corpus.append(text_content) pickle.dump(corpus,open("doc_index_corpus.p","wb"))

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

I already starting this, and here is what I have if I can get help to finish this assignment Here is another note that can help Upload a file called yourlastname.hw4.py for this asignment. Note it's...

Downloads/ 5 Computing 6- In-Lab Noteboo X Computing 6 - Assignment - X Computing 6 - Assignment-Co X Computing 6 - Assignment + ta Not syncing...

1. Please write the following code in Python 3. Also, please show all outputs. Error Handling Coding assignments can be saved as .py files when appropriate or .txt files otherwise, for example, you...

PYTHON : READ ALL Casino with Methods and a Class For this assignment you will have to investigate the use of the Python random library's random generator function, random.randrange(stop) ....

code should be in python python version 3 For Americans entering the work force in the 2020s, most of their retirement income will be a from a the money they have invested over their working career....

Please help, Python coding on Jupiter. Code the problems below using Jupyter Notebook. Provide a label for each problem using a comment. Your code solutions must be solved within your notebook in the...

Please help, Python coding on Jupiter. I have a pic of data below. dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2019 continentExp 12/14/2020 14 12...

29022021 OSCI GOS Assignment Assignment 5 Goals The goal of this assignment is to work with scripts and packages in Python Instructions You will be doing your work in Python for this assignment You...

Need help getting started on these questions. I am supposed to add code where it says "implement me" and write the answer where it says answer in one or two line. Need to fill in the "Implement me"...

A sample of n = 12 scores has EX = 72. What is the sample mean? The sample mean is

What is the yield to maturity of the February 1995 Treasury bond with the yield from the table? Verify the current yield. Why is the current yield higher than theyield-to-maturity? Today is February...

John Thompson, CEO of WVU, Inc. wants to raise $5M in a private equity in his early stage venture. Thompson projects net income of $11M in year five (five years from now) and knows that comparable...

In the context of corporate income tax, explain the concept of a "temporary difference". What is it that justifies the use of the description "temporary"?

KEY QUESTION To what extent have increases in U.S. real GDP resulted from more labor inputs? From higher labor productivity? Rearrange the following contributors to the growth of productivity in...

KEY QUESTION Assume a DVC and an IAC presently have real per capita outputs of $500 and $5000, respectively. If both nations have a 3 percent increase in their real per capita outputs, by how much...

Do you favor debt forgiveness to all the DVCs, just the poorest ones, or none at all? What incentive problem might debt relief create? Would you be willing to pay $20 a year more in personal income...