Question: Expected Behavior Write a Python function seq_sim(seq1, seq2, k) that takes as arguments two strings seq1 and seq2 and an integer k , and returns
Expected Behavior
Write a Python function seq_sim(seq1, seq2, k) that takes as arguments two strings seq1 and seq2 and an integer k, and returns a floating point value between 0 and 1 (inclusive) giving the similarity between seq1 and seq2. The similarity value should be computed as the Jaccard index applied to the sets of k-grams of seq1 and seq2 (where k is the third argument to the function). See Algorithm A1 on the phylogenetic problem spec for details of the algorithm.
Algorithm A1. Similarity between two genome sequences
The similarity between two sequences is a number between 0 and 1: a similarity of 1 indicates that the sequences are identical, while a similarity of 0 indicates that the sequences have nothing in common. The intuition behind our algorithm for computing the similarity between two sequences is that if we chop each sequence up into small pieces, then similar sequences will have a lot of pieces in common while dissimilar sequences will not.
This intuition is formalized using the Jaccard index. Given two sequences A and B, let the corresponding sets of N-grams be ngrams(A) and ngrams(B) respectively. Then, the similarity between A and B is given by
| similarity(A, B) = | | ngrams(A) ? ngrams(B) | |
| | ngrams(A) ? ngrams(B) | |
where |S| denotes the size of the set S. Note that different values of N for computing N-grams can produce different distance values for the same sequences, and thereby give rise to different phylogenetic trees.
For the purposes of this assignment, you should implement your N-grams as Python sets. The reason is to simplify the computation of unions and intersections:
to create an empty set: set()
the set corresponding to a list L is given by: set(L)
the union of sets S1 and S2 is given by: S1.union(S2)
the intersection of sets S1 and S2 is given by: S1.intersection(S2)
Note: Treating the N-grams for genome sequences as sets means that the presence or absence of duplicates in the sequences will not affect their similarity measure. This may not always seem intuitive.
You can use the code from the previous short problem as a helper function for this problem.
You can assume that both seq1 and seq2 have length at least k.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
