Question: PROBLEM 2 This exercise is about topic analysis. A topic is a latent variable that represents or summarizes important concepts of a text, such as
PROBLEM
This exercise is about topic analysis. A topic is a latent variable that
represents or summarizes important concepts of a text, such as its meaning
or main ideas. A topic is made up of several words semantically related to
each other according to a certain context. In the area of natural language
processing NLP it is part of a general task called information retrieval
IR For us from a machine learning perspective, we will consider it as an
unsupervised learning task based on a particular vector representation of
the texts. Consider a documentterm representation like the ones we saw
in class. A simple way to extract latent structures between documents and
terms is using latent semantic analysis LSA which is based on appropriate
factorizations of that matrix. Let be the TFIDF matrix of rank
with rows documents and columns terms A rank approximation
of this matrix is given by the SVD factorization ~~
where is diagonal with the largest eigenvalues of A and
contain the corresponding
We are considering the reduced representation of SVD where all zero
entries have been removed from and the corresponding columns from
and and even more, the smallest eigenvalues have been replaced
by zeros.
Left and right eigenvectors that define an orthonormal basis for the column
and row spaces, respectively. By applying this factorization in document
term matrices, we can extract the semantic and conceptual relationships
between documents and terms expressed in a set of components or to
pics through dense and lowdimensional representations, where
and provide us with a representation of the terms and documents,
respectively in terms of the topics, and gives us the importance of each
topic. In python, you can use the sklearn.decomposition. TruncatedSVD im
plementation. In this exercise, you will perform an analysis of topics in the
transcripts of the morning conferences of the presidency of Mexico, which
you can access in this repository To build your topic model, consider
the texts of the conferences per week during the years to using
the transcripts that correspond to the president, contained in the archives
"PRESIDENT ANDRES MANUEL LOPEZ OBRADOR.csv a Obtain
a TFIDF representation of the texts. Define the size of the vocabulary
and carry out the preprocessing that you consider necessary in the texts,
considering that for a topic analysis, it is not recommended that the voca
bulary be so large, and it is better to conserve words whose use within the
text can be associated with topics. Document and justify your settings. b
Obtain topics using SVD decomposition. Choose a suitable and justify
it Represents each topic through a word cloud of the terms that make
up each topic according to the importance expressed in the magnitudes of
the rows of Can you assign a representative "name" for each topic? c
Using the topic model adjusted in the previous step, obtain the correspon
ding representation of each of the president's conferences during the years of
the study, calculating the documenttopic matrix using the product
or with TruncatedSVD's transform method Assign each conference to its
corresponding topic using the maximum value of each row of the matrix as a criterion. Use lowdimensional visualizations based on PCA, Kernel PCA
and tSNE of the topic assignment you obtained. Do you see interesting pat
terns? Briefly describe your findings. d A problem that arises when using
SVD is the lack of interpretability, since it is not clear how negative values
in the matrices and can be considered. One way to solve this problem
is to use a nonfactorization. Negative Matrix NMF which is suitable for
matrices with nonnegative entries, such as TFIDFs. For a matrix A of rank
with nonnegative entries, NMF computes an approximation of rank
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
