Question: PROBLEM 2 This exercise is about topic analysis. A topic is a latent variable that represents or summarizes important concepts of a text, such as

PROBLEM 2
This exercise is about topic analysis. A topic is a latent variable that
represents or summarizes important concepts of a text, such as its meaning
or main ideas. A topic is made up of several words semantically related to
each other according to a certain context. In the area of natural language
processing (NLP), it is part of a general task called information retrieval
(IR). For us, from a machine learning perspective, we will consider it as an
unsupervised learning task based on a particular vector representation of
the texts. Consider a document-term representation like the ones we saw
in class. A simple way to extract latent structures between documents and
terms is using latent semantic analysis (LSA), which is based on appropriate
factorizations of that matrix. Let Amn be the TF-IDF matrix of rank r,
with m rows (documents) and n columns (terms). A rank approximation k
of this matrix is given by the SVD factorization A~~A(k)=U(k)(k)V(k)',
where (k) is diagonal ?1 with the k largest eigenvalues of A and U(k),V(k)
contain the corresponding
?1 We are considering the reduced representation of SVD, where all zero
entries have been removed from , and the corresponding columns from U
and V, and even more, the smallest k-r eigenvalues have been replaced
by zeros.
Left and right eigenvectors that define an orthonormal basis for the column
and row spaces, respectively. By applying this factorization in document-
term matrices, we can extract the semantic and conceptual relationships
between documents and terms expressed in a set of components (or to-
pics)k, through dense and low-dimensional representations, where Vnk(k)
and Umk(k) provide us with a representation of the terms and documents,
respectively in terms of the k topics, and (k) gives us the importance of each
topic. In python, you can use the sklearn.decomposition. TruncatedSVD im-
plementation. In this exercise, you will perform an analysis of topics in the
transcripts of the morning conferences of the presidency of Mexico, which
you can access in this repository ?2. To build your topic model, consider
the texts of the conferences per week during the years 2019 to 2023, using
the transcripts that correspond to the president, contained in the archives
"PRESIDENT ANDRES MANUEL LOPEZ OBRADOR.csv"3. a) Obtain
a TF-IDF representation of the texts. Define the size of the vocabulary
and carry out the preprocessing that you consider necessary in the texts,
considering that for a topic analysis, it is not recommended that the voca-
bulary be so large, and it is better to conserve words whose use within the
text can be associated with topics. Document and justify your settings. b)
Obtain k topics using SVD decomposition. Choose a suitable k and justify
it. Represents each topic through a word cloud ?4 of the terms that make
up each topic according to the importance expressed in the magnitudes of
the rows of V(k). Can you assign a representative "name" for each topic? c)
Using the topic model adjusted in the previous step, obtain the correspon-
ding representation of each of the president's conferences during the years of
the study, calculating the document-topic matrix using the product xV(k)
(or with TruncatedSVD's transform method). Assign each conference to its
corresponding topic using the maximum value of each row of the matrix as a criterion. Use low-dimensional visualizations based on PCA, Kernel PCA
and t-SNE of the topic assignment you obtained. Do you see interesting pat-
terns? Briefly describe your findings. d) A problem that arises when using
SVD is the lack of interpretability, since it is not clear how negative values
in the matrices U and V can be considered. One way to solve this problem
is to use a non-factorization. -Negative Matrix (NMF), which is suitable for
matrices with non-negative entries, such as TF-IDFs. For a matrix A of rank
r with non-negative entries, NMF computes an approximation of rank
PROBLEM 2 This exercise is about topic analysis.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!