Question: Question 2 ( 3 0 points ) : Consider the following strings: S 1 : ITCS 3 1 6 2 Basic Data Mining S 2
Question points: Consider the following strings:
S: ITCS Basic Data Mining
S: ITIS Data Mining
S: ITCS Basic Data Privacy
S: ITIS Data Privacy
A Construct a shingledocument matrix from the strings assuming each word as a
shingle.
B Compute the signature matrix using different permutations of your choice
C Compute all pairwise column similarities for both the and
Question points Evaluate the Scurve for similarity value
for the following values of and :
a and
b and
c and
Question points For each of the b pairs in question compute the threshold, that is
the value of for which the value of is exactly
Question points A plagiarism detection service uses locality sensitive hashing LSH
to find similar documents. Suppose the database has documents that you need to
analyze to find similar documents. You have the memory capacity to compute document
signatures of length and you set the number of bands to be and the size of each
band to be rows:
A What is the probability that two documents that are similar are identical in
one particular band?
B What is the probability that two documents that are similar, are non identical
in all the bands?
C What is the probability that two documents that are similar get assigned to
the same bucket?
Question points: Explain the essential steps to find similarities among documents
using complexity.
Question points: How would you pick k when using the kMeans algorithm? Explain
your reasoning.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
