Exercise build corpus Create a function named build corpus with a single parameter count count has a default value of 1 return a list with 'count' number of Harry Potter books So books build corpus(2) would return the text from hp1 txt and hp2 txt Create the TFIDF Model Create a function named build tf idf model which has two parameters docs a list of documents to consider stopwords a list of stopwords to consider When you create the model, use the following default values (which we will dabble with later) use idf True smooth idf True sublinear tf True norm None stop words stopwords (the incoming parameter) As a reminder (you should re check the sklearn documentation too) smooth idf True True idf 1 log( (doc N 1) (df n 1) ) False idf 1 log(doc N df n) sublinear tf True True tf 1 log(term count) False tf raw term count norm None, don't normalize tfidf results use idf True False only consider the counts Fit Transform Once you have the model built, simply call fit transform on the incoming set of documents As a reminder, fit transform is an efficient implementation of calling fit followed by transform fit builds the model It creates the idf vector transform creates the tfidf vectors Return a tuple (vectorizer, matrix) Finally return BOTH the Vectorizer created and the matrix returned from fit transform import warnings warnings simplefilter(action 'ignore', category FutureWarning) warnings simplefilter(action 'ignore', category UserWarning) from sklearn feature extraction text import TfidfVectorizer def build tf idf model(docs, stopwords ) stopwords Util load stopwords() vectorizer TfidfVectorizer(use idf True,smooth idf True,sublinear tf True,norm None,stop words stopwords) X vectorizer fit transform(docs) print(vectorizer) build the model matrix return the vectorizer, matrix return vectorizer,X def test build(corpus) corpus build corpus() vec,matrix build tf idf model(corpus) test build(build corpus(1)) Testing the Query We have most of the pieces in place The next thing you will want to do is build a function that accepts a query (as a string) and 'transforms' this to another matrix Essentially, you are taking your 'test' data (the query) (or data that the model hasn't been formally trained on) and pushing it through the tfidf machinery you built The output will be another matrix Finish the following implementation prepare query v1 has 2 named parameters corpus and query corpus is the list of documents default value None query is a string default value empty string if corpus is None, build one (using build tf idf model) with the first 3 Harry Potter books return the transformed query (another sparse matrix) A few notes and hints the query only needs to transformed NOT fit ed If you fit and transform, you are building and training a new model what you pass into the vectorizer to be transformed is an array of one element the query string Each item in the array is considered a separated document The query (a group of words) is the only document Luckily we can use sklearn's metrics for similarity The nice thing about sklearn's metrics is that they automatically work with sparse matrices The following shows how to use the metric from sklearn metrics pairwise import cosine similarity def print matching document(matrix, q vector) assert q vector shape 0 1, bad query vector (wrong size) for m idx, m in enumerate(matrix) for q idx, q in enumerate(q vector) print(cosine similarity(m,q)) This is the first time we are seeing the assert operator It will throw a run time exception if the condition is not met (and print the corresponding message) However, in production code, you will usually automatically remove them They are a great tool while developing code to help capture the semantics of a possible error Also note (when you run this code) the returned value from cosine similarity Be sure to type the above code in and run it It will help With that as an example, finish the following code block find matching document returns a tuple of the index and distance of the matrix that is closest to q vector the first value of the tuple is the index, the second is the the distance find match builds the query vector from the query string use the TfidfVectorizer's transform method it's essentially the same as prepare query v1 returns the value from find matching document from sklearn metrics pairwise import cosine similarity def find matching document(matrix, q vector) return a tuple of the index of the matrix (row) which is most similar to q vector the cosine distance return None,None def find match(query, corpus size 3) corpus build corpus(corpus size) vec, matrix build tf idf model(corpus) build q vector here return find matching document(matrix, q vector) The User Interface (UI) Like shopping at Honeydukes (HP Prisoner of Azkaban go ahead and confirm it), the excitement of being so close to the end is hard to contain Think of the user interface (UI) as a simple functional prototype You just need to show it can be done You have already written most of the code We will guide you through the missing parts Be sure to read the section after this code block for detailed information import matplotlib pyplot as plt import LessonUtil as Util def show image(path) with open(path, 'rb') as fd fig, ax plt subplots() ax set axis off() imgdata plt imread(fd) im ax imshow(imgdata) return fig def test UI(vec None, matrix None, query '', debug False) build vec, matrix if either parameter is None use the full harry potter corpus if you need to build transform the query string find the matching document if no document matches, use the first document get path for the image display the image (uncomment when ready) show image(path) return the winning index return idx ThE BOLD PART IS ALL THE CODE I HAVE

The Answer is in the image, click to view ...

Question: Exercise: build_corpus Create a function named build_corpus with a single parameter count count has a default value of 1 return a list with 'count' number

Exercise: build_corpus

Create a function named build_corpus with a single parameter count

count has a default value of 1
return a list with 'count' number of Harry Potter books. So books = build_corpus(2) would return the text from hp1.txt and hp2.txt.

Create the TFIDF Model

Create a function named build_tf_idf_model which has two parameters:

docs a list of documents to consider
stopwords a list of stopwords to consider

When you create the model, use the following default values (which we will dabble with later):

use_idf=True
smooth_idf=True
sublinear_tf=True
norm=None
stop_words=stopwords (the incoming parameter)

As a reminder (you should re-check the sklearn documentation too):

smooth_idf=True # True: idf = 1 + log( (doc_N+1)/(df_n+1) ) # False idf = 1 + log(doc_N/df_n) sublinear_tf=True # True: tf = 1 + log(term_count) # False: tf = raw term count norm=None, # don't normalize tfidf results use_idf=True # False: only consider the counts

Fit & Transform

Once you have the model built, simply call fit_transform on the incoming set of documents. As a reminder, fit_transform is an efficient implementation of calling fit followed by transform:

fit builds the model. It creates the idf vector
transform creates the tfidf vectors

Return a tuple (vectorizer, matrix)

Finally return BOTH the Vectorizer created and the matrix returned from fit_transform

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

warnings.simplefilter(action='ignore', category=UserWarning)

from sklearn.feature_extraction.text import TfidfVectorizer

def build_tf_idf_model(docs, stopwords=[]):

# stopwords = Util.load_stopwords()

vectorizer = TfidfVectorizer(use_idf=True,smooth_idf=True,sublinear_tf=True,norm=None,stop_words=stopwords)

X = vectorizer.fit_transform(docs)

print(vectorizer)

# build the model & matrix

# return the vectorizer, matrix

return vectorizer,X

def test_build(corpus):

# corpus = build_corpus()

vec,matrix = build_tf_idf_model(corpus)

test_build(build_corpus(1))

Testing the Query

We have most of the pieces in place. The next thing you will want to do is build a function that accepts a query (as a string) and 'transforms' this to another matrix. Essentially, you are taking your 'test' data (the query) (or data that the model hasn't been formally trained on) and pushing it through the tfidf machinery you built. The output will be another matrix.

Finish the following implementation:

prepare_query_v1 has 2 named parameters: corpus and query
- corpus is the list of documents; default value: None
- query is a string; default value: empty string
if corpus is None, build one (using build_tf_idf_model) with the first 3 Harry Potter books
return the transformed query (another sparse matrix)

A few notes and hints:

the query only needs to transformed NOT fited. If you fit and transform, you are building and training a new model.
what you pass into the vectorizer to be transformed is an array of one element: the query string. Each item in the array is considered a separated document. The query (a group of words) is the only document.

Luckily we can use sklearn's metrics for similarity. The nice thing about sklearn's metrics is that they automatically work with sparse matrices. The following shows how to use the metric:

from sklearn.metrics.pairwise import cosine_similarity def print_matching_document(matrix, q_vector): assert q_vector.shape[0] == 1, "bad query vector (wrong size)" for m_idx, m in enumerate(matrix): for q_idx, q in enumerate(q_vector): print(cosine_similarity(m,q))

This is the first time we are seeing the assert operator. It will throw a run time exception if the condition is not met (and print the corresponding message). However, in production code, you will usually automatically remove them. They are a great tool while developing code to help capture the semantics of a possible error.

Also note (when you run this code) the returned value from cosine_similarity. Be sure to type the above code in and run it. It will help.

With that as an example, finish the following code block:

find_matching_document returns a tuple of the index and distance of the matrix that is closest to q_vector
- the first value of the tuple is the index, the second is the the distance
find_match builds the query vector from the query string
- use the TfidfVectorizer's transform method
- it's essentially the same as prepare_query_v1
- returns the value from find_matching_document

from sklearn.metrics.pairwise import cosine_similarity

def find_matching_document(matrix, q_vector):

# return a tuple of

# the index of the matrix (row) which is most similar to q_vector

# the cosine distance

return None,None

def find_match(query, corpus_size=3):

corpus = build_corpus(corpus_size)

vec, matrix = build_tf_idf_model(corpus)

# build q_vector here

return find_matching_document(matrix, q_vector)

The User Interface (UI)

Like shopping at Honeydukes (HP; Prisoner of Azkaban -- go ahead and confirm it), the excitement of being so close to the end is hard to contain.

Think of the user interface (UI) as a simple functional prototype. You just need to show it can be done.

You have already written most of the code. We will guide you through the missing parts. Be sure to read the section after this code block for detailed information:

import matplotlib.pyplot as plt

import LessonUtil as Util

def show_image(path):

with open(path, 'rb') as fd:

fig, ax = plt.subplots()

ax.set_axis_off()

imgdata = plt.imread(fd)

im = ax.imshow(imgdata)

return fig

def test_UI(vec=None, matrix=None, query='', debug=False):

# build vec, matrix if either parameter is None

# use the full harry potter corpus if you need to build

# transform the query string

# find the matching document

# if no document matches, use the first document

# get path for the image

# display the image (uncomment when ready)

# show_image(path)

# return the winning index

return idx

ThE BOLD PART IS ALL THE CODE I HAVE.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 7 : Frequency Distribution In this problem, we will create a function named freq _ dist ( ) that creates a frequency distribution based on the elements of a list. The function will accept a...

C# Summary: Methods, overloading, optional parameters. Write a program that will analyze text. Write a function that will take a string parameter with "Hello" as the default value for the optional...

Create a new python file called assignment 1 _ part 1 . py . All code for this part should be in this file and eventually pushed to Github. Create a function named list _ divide that takes in two...

1. Create a function named get_count that: Has two parameters: data_list: a list of numeric values. update_count: an optional parameter with a default value of False. If update_count is True, this...

# Workshop 5: Functions and Error Handling In this workshop, you code a function object and a lambda expression, and handle exceptions. You are to create a class template that manages a family of...

If you're familiar with books, then you're probably familiar with an ISBN. ISBN is an acronym for International Standard Book Number. In many countries, an ISBN agency exists to assign a unique ISBN...

Part A: Convert the Tile Module to a Class (35% = 24% test program + 11% code] In Part A, you will convert the l'ile type to an encapsulated class. You will also add a class invariant that requires...

How to write the insert, search, and remove functions for this hash table program? I'm stuck... This program is written in C++ Hash Tables Hash Table Header File Copy and paste the following code...

Hash Tables So i need help on all the HashTable.cpp function. i am having a hard time understanding on what to write on the funtion: HashTable(const HashTable& ht), hash(string key), countBuck(int...

Hash Tables Hash Table Header File Copy and paste the following code into a header file named HashTable.h Please do not alter this file in any way or you may not receive credit for this lab For this...

71MPa Txy 16 M Pa 16 MPa. 60 Try 74 MPa At a point in a piece of stressed material. the normal stress on a certain plane is 16 M Pa tension and the shearing stress 7, on this plane is unknown. On a...

A company is preparing its cash budget. Your cash balance as of January 1 is $290,000 and you have a minimum cash requirement of $340,000. The following data has been provided: Cash receipts January...

Your credit card charges an interest rate of 1 . 9 2 % per month. You have a current balance of $ 1 , 0 0 0 , and want to pay it off. Suppose you can afford to pay $ 9 5 per month. What will your...

The decisions made by finance managers should all be ones which increase the: Question 9 Select one: a . size of the firm. b . growth rate of the firm. c . marketability of the managers. d . market...

How do Dimensional Database Models differ from Relational Models?

What type of processing do Relational Databases support?

Describe several aggregation operators.