Question: Expected Behavior Write a Python function seq_sim(seq1, seq2, k) that takes as arguments two strings seq1 and seq2 and an integer k , and returns

Expected Behavior

Write a Python function seq_sim(seq1, seq2, k) that takes as arguments two strings seq1 and seq2 and an integer k, and returns a floating point value between 0 and 1 (inclusive) giving the similarity between seq1 and seq2. The similarity value should be computed as the Jaccard index applied to the sets of k-grams of seq1 and seq2 (where k is the third argument to the function). See Algorithm A1 on the phylogenetic problem spec for details of the algorithm.

Algorithm A1. Similarity between two genome sequences

The similarity between two sequences is a number between 0 and 1: a similarity of 1 indicates that the sequences are identical, while a similarity of 0 indicates that the sequences have nothing in common. The intuition behind our algorithm for computing the similarity between two sequences is that if we chop each sequence up into small pieces, then similar sequences will have a lot of pieces in common while dissimilar sequences will not.

This intuition is formalized using the Jaccard index. Given two sequences A and B, let the corresponding sets of N-grams be ngrams(A) and ngrams(B) respectively. Then, the similarity between A and B is given by

similarity(A, B) = | ngrams(A) ? ngrams(B) |
| ngrams(A) ? ngrams(B) |

where |S| denotes the size of the set S. Note that different values of N for computing N-grams can produce different distance values for the same sequences, and thereby give rise to different phylogenetic trees.

For the purposes of this assignment, you should implement your N-grams as Python sets. The reason is to simplify the computation of unions and intersections:

to create an empty set: set()

the set corresponding to a list L is given by: set(L)

the union of sets S1 and S2 is given by: S1.union(S2)

the intersection of sets S1 and S2 is given by: S1.intersection(S2)

Note: Treating the N-grams for genome sequences as sets means that the presence or absence of duplicates in the sequences will not affect their similarity measure. This may not always seem intuitive.

You can use the code from the previous short problem as a helper function for this problem.

You can assume that both seq1 and seq2 have length at least k.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!