Question: Exercise: build_corpus Create a function named build_corpus with a single parameter count count has a default value of 1 return a list with 'count' number
Exercise: build_corpus
Create a function named build_corpus with a single parameter count
- count has a default value of 1
- return a list with 'count' number of Harry Potter books. So books = build_corpus(2) would return the text from hp1.txt and hp2.txt.
Create the TFIDF Model
Create a function named build_tf_idf_model which has two parameters:
- docs a list of documents to consider
- stopwords a list of stopwords to consider
When you create the model, use the following default values (which we will dabble with later):
- use_idf=True
- smooth_idf=True
- sublinear_tf=True
- norm=None
- stop_words=stopwords (the incoming parameter)
As a reminder (you should re-check the sklearn documentation too):
smooth_idf=True # True: idf = 1 + log( (doc_N+1)/(df_n+1) ) # False idf = 1 + log(doc_N/df_n) sublinear_tf=True # True: tf = 1 + log(term_count) # False: tf = raw term count norm=None, # don't normalize tfidf results use_idf=True # False: only consider the counts
Fit & Transform
Once you have the model built, simply call fit_transform on the incoming set of documents. As a reminder, fit_transform is an efficient implementation of calling fit followed by transform:
- fit builds the model. It creates the idf vector
- transform creates the tfidf vectors
Return a tuple (vectorizer, matrix)
Finally return BOTH the Vectorizer created and the matrix returned from fit_transform
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
from sklearn.feature_extraction.text import TfidfVectorizer
def build_tf_idf_model(docs, stopwords=[]):
# stopwords = Util.load_stopwords()
vectorizer = TfidfVectorizer(use_idf=True,smooth_idf=True,sublinear_tf=True,norm=None,stop_words=stopwords)
X = vectorizer.fit_transform(docs)
print(vectorizer)
# build the model & matrix
# return the vectorizer, matrix
return vectorizer,X
def test_build(corpus):
# corpus = build_corpus()
vec,matrix = build_tf_idf_model(corpus)
test_build(build_corpus(1))
Testing the Query
We have most of the pieces in place. The next thing you will want to do is build a function that accepts a query (as a string) and 'transforms' this to another matrix. Essentially, you are taking your 'test' data (the query) (or data that the model hasn't been formally trained on) and pushing it through the tfidf machinery you built. The output will be another matrix.
Finish the following implementation:
- prepare_query_v1 has 2 named parameters: corpus and query
- corpus is the list of documents; default value: None
- query is a string; default value: empty string
- if corpus is None, build one (using build_tf_idf_model) with the first 3 Harry Potter books
- return the transformed query (another sparse matrix)
A few notes and hints:
- the query only needs to transformed NOT fited. If you fit and transform, you are building and training a new model.
- what you pass into the vectorizer to be transformed is an array of one element: the query string. Each item in the array is considered a separated document. The query (a group of words) is the only document.
Luckily we can use sklearn's metrics for similarity. The nice thing about sklearn's metrics is that they automatically work with sparse matrices. The following shows how to use the metric:
from sklearn.metrics.pairwise import cosine_similarity def print_matching_document(matrix, q_vector): assert q_vector.shape[0] == 1, "bad query vector (wrong size)" for m_idx, m in enumerate(matrix): for q_idx, q in enumerate(q_vector): print(cosine_similarity(m,q))
This is the first time we are seeing the assert operator. It will throw a run time exception if the condition is not met (and print the corresponding message). However, in production code, you will usually automatically remove them. They are a great tool while developing code to help capture the semantics of a possible error.
Also note (when you run this code) the returned value from cosine_similarity. Be sure to type the above code in and run it. It will help.
With that as an example, finish the following code block:
- find_matching_document returns a tuple of the index and distance of the matrix that is closest to q_vector
- the first value of the tuple is the index, the second is the the distance
- find_match builds the query vector from the query string
- use the TfidfVectorizer's transform method
- it's essentially the same as prepare_query_v1
- returns the value from find_matching_document
from sklearn.metrics.pairwise import cosine_similarity
def find_matching_document(matrix, q_vector):
# return a tuple of
# the index of the matrix (row) which is most similar to q_vector
# the cosine distance
return None,None
def find_match(query, corpus_size=3):
corpus = build_corpus(corpus_size)
vec, matrix = build_tf_idf_model(corpus)
#
# build q_vector here
#
return find_matching_document(matrix, q_vector)
The User Interface (UI)
Like shopping at Honeydukes (HP; Prisoner of Azkaban -- go ahead and confirm it), the excitement of being so close to the end is hard to contain.
Think of the user interface (UI) as a simple functional prototype. You just need to show it can be done.
You have already written most of the code. We will guide you through the missing parts. Be sure to read the section after this code block for detailed information:
import matplotlib.pyplot as plt
import LessonUtil as Util
def show_image(path):
with open(path, 'rb') as fd:
fig, ax = plt.subplots()
ax.set_axis_off()
imgdata = plt.imread(fd)
im = ax.imshow(imgdata)
return fig
def test_UI(vec=None, matrix=None, query='', debug=False):
# build vec, matrix if either parameter is None
# use the full harry potter corpus if you need to build
#
# transform the query string
#
#
# find the matching document
#
#
# if no document matches, use the first document
#
#
# get path for the image
#
# display the image (uncomment when ready)
# show_image(path)
# return the winning index
return idx
ThE BOLD PART IS ALL THE CODE I HAVE.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
