Question: Use group _ 1 . csv data in the canvas. The dataset is collection of 3 0 0 support tickets manually labeled for semantic similarity,

Use group_1.csv data in the canvas. The dataset is collection of 300 support
tickets manually labeled for semantic similarity, obtained from an IT support company in the Florianpolis
(Brazil) region. Each ticket is represented by an unstructured text field, which is typed by the user that
opened the call. The labeling process was performed in 2022 by three IT support professionals. The corpus
contains tickets in many languages, mainly English, German, Portuguese and Spanish. For this task, you
must use only the English language support tickets.
First preprocess the data using tokenization (you can use x.split() in python) and lower casing (can use
x.lower()) the documents. Create a term-frequency document matrix and calculate document similarity
matrix using:
a. Euclidean Distance,
b. Manhattan Distance
c. Cosine Similarity.
For this task, you may use term-frequency matrix like CountVectorizer in python. However, to
calculate the similarity matrix you must not use any direct package. The output of this question should be
three matrices. And mention five documents by rank that are the most similar document to id 98287 using
the three different distance formula and give comparative analysis.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!