Latent semantic indexing is an SVD based technique that can be used to discover text documents similar to each other Assume that we are given a set of m documents D 1 , , D m Using a bag of words technique described in Example 2 1, we can represent each document D j is described by an n vector d j ...

1 2 With the normalized vector corresponding to a generic document we have 3 We can write which can ...

Latent semantic indexing is an SVD-based technique that can be used to discover text documents similar to

Question:

Latent semantic indexing is an SVD-based technique that can be used to discover text documents similar to each other. Assume that we are given a set of m documents D₁, . . . , D_m. Using a “bag-of-words” technique described in Example 2.1, we can represent each document D_j is described by an n-vector d_j, where n is the total number of distinct words appearing in the whole corpus. In this exercise, we assume that the vectors d_j are constructed as follows: d_j(i) = 1 if word i appears in document D_j, and 0 otherwise. We refer to the n x m matrix as the “raw” term-by-document matrix. We will also use a normalnormalized⁹version of that matrix

Assume we are given another document, referred to as the “query document,” which is not part of the collection. We describe that query document as a n-dimensional vector q, with zeros everywhere, except a 1 at indices corresponding to the terms that appear in the query. We seek to retrieve documents that are “most similar” to the query, in some sense. We denote by the normalized vector

1. A first approach is to select the documents that contain the largest number of terms in common with the query document. Explain how to implement this approach, based on a certain matrix-vector product, which you will determine.

2. Another approach is to find the closest document by selecting the index j such that is the smallest. This approach can introduce some biases, if for example the query document is much shorter than the other documents. Hence a measure of similarity based on the normalized vectors, has been proposed, under the name of “cosine similarity”. Justify the use of this name for that method, and provide a formulation based on a certain matrix-vector product, which you will determine.