Question: Introduction and Perspective In the previous Assignment, you obtained two different sets of vectors ( actually , resultant matrices ) representing the corpus that we

Introduction and Perspective

In the previous Assignment, you obtained two different sets of vectors

(

actually

,

resultant matrices

)

representing the corpus that we created;

-

IDF: We had a TF

-

IDF for each of the entire set of extracted terms in the entire corpus. Each vector field

(

one vector per document

)

represented the TF

-

IDF for that specific term's occurrence within the corpus. A higher TF

-

IDF meant that the term was both prevalent

(

across the corpus

)

and prominent

(

within at least one or more documents

) .

Additionally, we had the term frequency for each of those terms within the document.

Word

-

embedding and Document embeddings: Word embeddings provided insight into a low dimensional embedding of words and the document embedding showed us how the documents embedded in that low dimensional space also.

ELMo: ELMo provided another way to create word vectors

(

low dimensional

)

but uses a more advanced model called a bi

-

directional language model.

In the last Assignment, you assessed these results against what you expected would be important terms, based on

(1)

your manual term extraction,

(2)

the results of passing your documents through two different term extraction engines, and

(3)

observing what terms your colleagues found as important

(

which they posted to that week's Discussion

) .

Now you will explore a classification task or a deeper exploration of clustering.

What to Do

Focus on EITHER clustering OR classification, and analyze, assess, and interpret the outputs. If you spent a lot of time on the analysis of clustering for Assignment

1,

please explore classification for this assignment. Or use clustering to help you establish class labels that you would then use for measuring the performance of your classification method and include this in your analysis. So in this case it would be a two

-

step process.

If You Select Clustering

-

use the movie review dataset

The clusters are probably not what you would want if you were manually clustering the documents. Very likely, you have one or two very large, amoeba

-

like clusters that seem to include all topics. You probably have a couple of outliers. You may have some clusters that make sense.

Even with the clusters that make sense, you can probably find:

Documents that are in a given cluster that you don't think should be there, and

Clusters that are missing certain documents that you think should have been included.

Your mission

(

if you decide to accept it

)

is to assess the clusters, figure out what is "right" and what is "wrong"

(

or what needs to be fixed

),

and trace back the cause

(

as much as you can

)

to what was happening with the input vectors.

You can work with clusters produced EITHER by the TF

-

IDF OR the word

-

embedding

(

Doc

2

Vec

)

algorithm.

Go back to the vector inputs corresponding to each of the documents. Did they contain sufficient terms and term frequency strengths

(

or term

-

representation, in the case of word

-

embedding

)

to give the results that you thought would make sense?

What was not working quite right?

This is the time to dig deeper and improve the results based on how you think these documents should cluster.

(

I would strongly suggest you decide on the ground truth before performing this analysis, i

.

.

cluster by genre for example

)

I would expect for this assignment you formally measure your method. We will talk more about this during the sync session.

If You Select Classification

-

use the TripAdvisor dataset

Your process will be very similar to the above. However, if you are going to perform classification, then you will need to use your labeled dataset. Luckily we captured the labels in the metadata when we performed the data collection.

You will likely need to do more than a simple bag of words approach. You should explore phrase extraction, n

-

gram extraction, and the other pre

-

processing steps we reviewed.

Don't forget you will need to measure performance. You will use your labeled data as your ground truth.

If you would like to explore using a pre

-

trained LLM for this task, you could experiment with feeding your raw text into this model. We will provide you with code that you can run in CoLab. If you choose this task, you will need to set this up as a binary classification. A great example would be a sentiment classification or you could choose to classify types of restaurants

(

Italian vs

.

Chinese

) .

It is up to you.

Using Pre

-

trained BERT

If you choose to leverage a pre

-

trained BERT model to perform classification or sentiment analysis please use the TripAdvisor reviews. TripAdvisor is a great dataset for sentiment analysis but you could set up other types of classification problems. Your steps will be as follows:

Preprocess the TripAdvisor review data.

Split the data into train

/

validation

/

test

(

if fine

-

tuning

)

or just test

Either use the BERT pre

-

trained model out of the box or fine

-

tune a pre

-

trained BERT model.

Evaluate the model's performance

Why...give python code for both ci

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

E-Physics Tutorial Book Note: I have attached below the lesson for the 3 activities. Thank you so much for your time and effort. Activity 1. Directions: Use the right-hand rule to determine the...

Let A, B be sets. Define: (a) the Cartesian product (A B) (b) the set of relations R between A and B (c) the identity relation A on the set A [3 marks] Suppose S, T are relations between A and B, and...

It appears that because of COVID limitations, data was collected virtually through phone, email, and surveys with a population that may have barriers to those methods of data collection. How do you...

Jupiter Notebook We have covered some of the limitations of single layer neural networks in class, but they are still powerful learning systems that provide a good way to begin learning about how to...

Jupyter Notebook Now that we have tried our hand at some single-layer nets, let's see how they stack up compared to multi-layer nets. :) We will be exploring the basic concepts of learning non-linear...

Advanced Linear Algebra / Advanced Math / Matlab question need help! Some of the needed codes are attached. In the question, it talks about the HW 6.1 but it can be neglected because every thing...

Satellite Data Retrieval, Reference Frames, Numerical and Analytical Orbital Propagation SSD Individual Assignment RMIT University Figure 1: Example errors between a ground truth ephemeris and...

Issue: Summer 2002 Mission Current Issue Editorial Board Past Issues Kravis Leadership Institute Claremont McKenna College Fairness in Leader-Member Exchange Theory: Do We All Belong On The Inside?...

The American Association of Acupuncturists is a professional association for acupuncturists that has 10,000 members. The association operates from a central headquarters but has local chapters...

What do you observe immediately and after 5 minutes? Prepare a table like the one below.

please find the missing values from entry 3 , 1 8 , 8 5 0 1 8 , 8 5 0 3 7 , 7 0 0 was incorrect Required Information Problem 1 5 - 3 3 ( Algo ) ( LO 1 5 - 2 , 1 5 - 7 ) [ The following information...

xample/ Which example demonstrates how the digital marketspace creates customer valuc through form utility? Forty-five percent of sales at Amazon.com are from buyers who live outside the United St