Question: Introduction and Perspective In the previous Assignment, you obtained two different sets of vectors ( actually , resultant matrices ) representing the corpus that we

Introduction and Perspective
In the previous Assignment, you obtained two different sets of vectors (actually, resultant matrices) representing the corpus that we created;
TF-IDF: We had a TF-IDF for each of the entire set of extracted terms in the entire corpus. Each vector field (one vector per document) represented the TF-IDF for that specific term's occurrence within the corpus. A higher TF-IDF meant that the term was both prevalent (across the corpus) and prominent (within at least one or more documents). Additionally, we had the term frequency for each of those terms within the document.
Word-embedding and Document embeddings: Word embeddings provided insight into a low dimensional embedding of words and the document embedding showed us how the documents embedded in that low dimensional space also.
ELMo: ELMo provided another way to create word vectors (low dimensional) but uses a more advanced model called a bi-directional language model.
In the last Assignment, you assessed these results against what you expected would be important terms, based on (1) your manual term extraction, (2) the results of passing your documents through two different term extraction engines, and (3) observing what terms your colleagues found as important (which they posted to that week's Discussion).
Now you will explore a classification task or a deeper exploration of clustering.
What to Do
Focus on EITHER clustering OR classification, and analyze, assess, and interpret the outputs. If you spent a lot of time on the analysis of clustering for Assignment 1, please explore classification for this assignment. Or use clustering to help you establish class labels that you would then use for measuring the performance of your classification method and include this in your analysis. So in this case it would be a two-step process.
If You Select Clustering - use the movie review dataset
The clusters are probably not what you would want if you were manually clustering the documents. Very likely, you have one or two very large, amoeba-like clusters that seem to include all topics. You probably have a couple of outliers. You may have some clusters that make sense.
Even with the clusters that make sense, you can probably find:
Documents that are in a given cluster that you don't think should be there, and
Clusters that are missing certain documents that you think should have been included.
Your mission (if you decide to accept it) is to assess the clusters, figure out what is "right" and what is "wrong" (or what needs to be fixed), and trace back the cause (as much as you can) to what was happening with the input vectors.
You can work with clusters produced EITHER by the TF-IDF OR the word-embedding (Doc2Vec) algorithm.
Go back to the vector inputs corresponding to each of the documents. Did they contain sufficient terms and term frequency strengths (or term-representation, in the case of word-embedding) to give the results that you thought would make sense?
What was not working quite right?
This is the time to dig deeper and improve the results based on how you think these documents should cluster. (I would strongly suggest you decide on the ground truth before performing this analysis, i.e. cluster by genre for example)
I would expect for this assignment you formally measure your method. We will talk more about this during the sync session.
If You Select Classification - use the TripAdvisor dataset
Your process will be very similar to the above. However, if you are going to perform classification, then you will need to use your labeled dataset. Luckily we captured the labels in the metadata when we performed the data collection.
You will likely need to do more than a simple bag of words approach. You should explore phrase extraction, n-gram extraction, and the other pre-processing steps we reviewed.
Don't forget you will need to measure performance. You will use your labeled data as your ground truth.
If you would like to explore using a pre-trained LLM for this task, you could experiment with feeding your raw text into this model. We will provide you with code that you can run in CoLab. If you choose this task, you will need to set this up as a binary classification. A great example would be a sentiment classification or you could choose to classify types of restaurants (Italian vs. Chinese). It is up to you.
Using Pre-trained BERT
If you choose to leverage a pre-trained BERT model to perform classification or sentiment analysis please use the TripAdvisor reviews. TripAdvisor is a great dataset for sentiment analysis but you could set up other types of classification problems. Your steps will be as follows:
Preprocess the TripAdvisor review data.
Split the data into train/validation/test (if fine-tuning) or just test
Either use the BERT pre-trained model out of the box or fine-tune a pre-trained BERT model.
Evaluate the model's performance
Why...give python code for both ci

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!