Question: PROBLEM 2 This exercise is about topic analysis. A topic is a latent variable that represents or summarizes important concepts of a text, such as

PROBLEM

2

This exercise is about topic analysis. A topic is a latent variable that

represents or summarizes important concepts of a text, such as its meaning

or main ideas. A topic is made up of several words semantically related to

each other according to a certain context. In the area of natural language

processing

(

NLP

),

it is part of a general task called information retrieval

(

) .

For us

,

from a machine learning perspective, we will consider it as an

unsupervised learning task based on a particular vector representation of

the texts. Consider a document

-

term representation like the ones we saw

in class. A simple way to extract latent structures between documents and

terms is using latent semantic analysis

(

LSA

),

which is based on appropriate

factorizations of that matrix. Let

A_{m n}

be the TF

-

IDF matrix of rank

r,

with

m

rows

(

documents

)

and

n

columns

(

terms

) .

A rank approximation

k

of this matrix is given by the SVD factorization

A

A^{(k)} = U^{(k)}^{(k)} V^{(k)'},

where

^{(k)}

is diagonal

?^{1}

with the

k

largest eigenvalues of A and

U^{(k)}, V^{(k)}

contain the corresponding

?^{1}

We are considering the reduced representation of SVD

,

where all zero

entries have been removed from

,

and the corresponding columns from

U

and

V,

and even more, the smallest

k - r

eigenvalues have been replaced

by zeros.

Left and right eigenvectors that define an orthonormal basis for the column

and row spaces, respectively. By applying this factorization in document

-

term matrices, we can extract the semantic and conceptual relationships

between documents and terms expressed in a set of components

(

or to

-

pics

) k,

through dense and low

-

dimensional representations, where

V_{n k}^{(k)}

and

U_{m k}^{(k)}

provide us with a representation of the terms and documents,

respectively in terms of the

k

topics, and

^{(k)}

gives us the importance of each

topic. In python, you can use the sklearn.decomposition. TruncatedSVD im

-

plementation. In this exercise, you will perform an analysis of topics in the

transcripts of the morning conferences of the presidency of Mexico, which

you can access in this repository

?^{2} .

To build your topic model, consider

the texts of the conferences per week during the years

2019

2023,

using

the transcripts that correspond to the president, contained in the archives

"PRESIDENT ANDRES MANUEL LOPEZ OBRADOR.csv

" 3 .

)

Obtain

a TF

-

IDF representation of the texts. Define the size of the vocabulary

and carry out the preprocessing that you consider necessary in the texts,

considering that for a topic analysis, it is not recommended that the voca

-

bulary be so large, and it is better to conserve words whose use within the

text can be associated with topics. Document and justify your settings. b

)

Obtain

k

topics using SVD decomposition. Choose a suitable

k

and justify

.

Represents each topic through a word cloud

?^{4}

of the terms that make

up each topic according to the importance expressed in the magnitudes of

the rows of

V^{(k)} .

Can you assign a representative "name" for each topic? c

)

Using the topic model adjusted in the previous step, obtain the correspon

-

ding representation of each of the president's conferences during the years of

the study, calculating the document

-

topic matrix using the product

x V^{(k)}

(

or with TruncatedSVD's transform method

) .

Assign each conference to its

corresponding topic using the maximum value of each row of the matrix as a criterion. Use low

-

dimensional visualizations based on PCA, Kernel PCA

and t

-

SNE of the topic assignment you obtained. Do you see interesting pat

-

terns? Briefly describe your findings. d

)

A problem that arises when using

SVD is the lack of interpretability, since it is not clear how negative values

in the matrices

U

and

V

can be considered. One way to solve this problem

is to use a non

-

factorization.

-

Negative Matrix

(

NMF

),

which is suitable for

matrices with non

-

negative entries, such as TF

-

IDFs. For a matrix A of rank

r

with non

-

negative entries, NMF computes an approximation of rank

PROBLEM 2 This exercise is about topic analysis.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

See page 129- 137 on attachment for more details there are five steps to the project. Step 1: Create the loan amortization schedule for the property. Step 2: Create the depreciation schedule. Step 3:...

Topic: Conducting personal job interviews using the star model 1-Design a two-hour training work plan for 10 trainees 2-Determine the quality of trainees 3-Use the training design model Formulate one...

I hope you can answer this question and find the reference below the question. Thank you Topic: Conducting personal job interviews using the STAR model 1- Design a two-hour training work plan for 10...

Please read chapter 5 and answer the questions and see the ( guide to answer number 3) For each case study, you will view the material as the student's teacher, read the information provided and...

This chapter presents students and early career executives with a sound understanding of theory. Theory is explored in terms of both anatomy (parts of the whole) and physiology (relationships with...

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

How much do you agree with the criteria? Which one do you prefer using in an emergency, If the standards are consistent, you can give compliments or make suggestions explain why? meducators from...

How are the standards similar, different and if they are identical, explain why you think they are identical meducators from kindergarten through college, and parents, students, and other Writing,...

Please read chapter 6 and answer the questions and see the ( guide to answer number 3) For each case study, you will view the material as the student's teacher, read the information provided and...

Student Name Enrolled ID. No MS 7004 Independent Study MS 7002 Thesis 1. Title: 2. Type of Research: 3. Background of the Research: 4. Research Objectives: 5. Research Questions: 6. Variables:...

In order to accurately assess the capital structure of a firm, it is necessary to convert its balance sheet figures from historical book values to market values. KJM Corporation's balance sheet (book...

Distinguish between an auditors responsibilities to detect and report errors, illegal acts, and fraud. What role does materiality have in determining the proper reporting and disclosure of such...

Question 5 1 pts These two substances work synergistically to regulate eating behavioc. Relatively speaking, Ghrelin and Leptin Regulates where far is deposited, releases far for stored evergy Tells...

Select the four extended ERP components. Check All That Apply accounting and finance production and materials management human resource business intelligence customer relationship management