Question: Question 2 ( 3 0 points ) : Consider the following strings: S 1 : ITCS 3 1 6 2 Basic Data Mining S 2

Question 2(30 points): Consider the following strings:
S1: ITCS 3162 Basic Data Mining
S2: ITIS 3162 Data Mining
S3: ITCS 3151 Basic Data Privacy
S4: ITIS 3151 Data Privacy
A. Construct a shingle-document matrix from the 4 strings assuming each word as a
shingle.
B. Compute the signature matrix using 4 different permutations of your choice
C. Compute all pairwise column similarities for both the ColCol and SigSig
Question 3(30 points) Evaluate the S-curve 1-(1-tt)b for similarity value t=0.1,0.2,0.3,
0.4,0.5,0.6.0.7,0.8,0.9, for the following values of r and b :
a.r=5 and b=20.
b.r=10 and b=30
c.r=25 and b=10.
Question 4(15 points) For each of the (r, b) pairs in question 3, compute the threshold, that is,
the value of t for which the value of 1-(1-t)b is exactly 12.
Question 5(30 points) A plagiarism detection service uses locality sensitive hashing (LSH)
to find similar documents. Suppose the database has 100,000 documents that you need to
analyze to find similar documents. You have the memory capacity to compute document
signatures of length 1024, and you set the number of bands to be 25 and the size of each
band to be 16 rows:
A. What is the probability that two documents that are 75% similar are identical in
one particular band?
B. What is the probability that two documents that are 50% similar, are non identical
in all the bands?
C. What is the probability that two documents that are 80% similar get assigned to
the same bucket?
Question 6(15 points): Explain the 3 essential steps to find similarities among documents
using O(N) complexity.
Question 7(20 points): How would you pick k when using the k-Means algorithm? Explain
your reasoning.
 Question 2(30 points): Consider the following strings: S1: ITCS 3162 Basic

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!