Question: Use group _ 1 . csv data in the canvas. The dataset is collection of 3 0 0 support tickets manually labeled for semantic similarity,
Use groupcsv data in the canvas. The dataset is collection of support
tickets manually labeled for semantic similarity, obtained from an IT support company in the Florianpolis
Brazil region. Each ticket is represented by an unstructured text field, which is typed by the user that
opened the call. The labeling process was performed in by three IT support professionals. The corpus
contains tickets in many languages, mainly English, German, Portuguese and Spanish. For this task, you
must use only the English language support tickets.
First preprocess the data using tokenization you can use xsplit in python and lower casing can use
xlower the documents. Create a termfrequency document matrix and calculate document similarity
matrix using:
a Euclidean Distance,
b Manhattan Distance
c Cosine Similarity.
For this task, you may use termfrequency matrix like CountVectorizer in python. However, to
calculate the similarity matrix you must not use any direct package. The output of this question should be
three matrices. And mention five documents by rank that are the most similar document to id using
the three different distance formula and give comparative analysis.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
