Question: Expected Behavior Write a Python function seq_sim(seq1, seq2, k) that takes as arguments two strings seq1 and seq2 and an integer k , and returns

Expected Behavior

Write a Python function seq_sim(seq1, seq2, k) that takes as arguments two strings seq1 and seq2 and an integer k, and returns a floating point value between 0 and 1 (inclusive) giving the similarity between seq1 and seq2. The similarity value should be computed as the Jaccard index applied to the sets of k-grams of seq1 and seq2 (where k is the third argument to the function). See Algorithm A1 on the phylogenetic problem spec for details of the algorithm.

Algorithm A1. Similarity between two genome sequences

The similarity between two sequences is a number between 0 and 1: a similarity of 1 indicates that the sequences are identical, while a similarity of 0 indicates that the sequences have nothing in common. The intuition behind our algorithm for computing the similarity between two sequences is that if we chop each sequence up into small pieces, then similar sequences will have a lot of pieces in common while dissimilar sequences will not.

This intuition is formalized using the Jaccard index. Given two sequences A and B, let the corresponding sets of N-grams be ngrams(A) and ngrams(B) respectively. Then, the similarity between A and B is given by

similarity(A, B) =	\| ngrams(A) ? ngrams(B) \|
	\| ngrams(A) ? ngrams(B) \|

where |S| denotes the size of the set S. Note that different values of N for computing N-grams can produce different distance values for the same sequences, and thereby give rise to different phylogenetic trees.

For the purposes of this assignment, you should implement your N-grams as Python sets. The reason is to simplify the computation of unions and intersections:

to create an empty set: set()

the set corresponding to a list L is given by: set(L)

the union of sets S1 and S2 is given by: S1.union(S2)

the intersection of sets S1 and S2 is given by: S1.intersection(S2)

Note: Treating the N-grams for genome sequences as sets means that the presence or absence of duplicates in the sequences will not affect their similarity measure. This may not always seem intuitive.

You can use the code from the previous short problem as a helper function for this problem.

You can assume that both seq1 and seq2 have length at least k.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Expected Behavior Write a Python function seq_sim(seq1, seq2, k) that takes as arguments two strings seq1 and seq2 and an integer k , and returns a floating point value between 0 and 1 (inclusive)...

Previous problems's code : CSc 120: Sequence-set Similarity Expected Behavior Write a Python function seq set_sim(seq set1, seq set2, k) that takes as arguments two sets of strings seq set1 and seq...

Expected Behavior Write a Python function seq_set_sim(seq_set1, seq_set2, k) that takes as arguments two sets of strings seq_set1 and seq_set2 and an integer k , and returns a floating point value...

Write a Python function seq_sim(seq1, seq2, k) that takes as arguments two strings seq1 and seq2 and an integer k , and returns a floating point value between 0 and 1 (inclusive) giving the...

Write a Python function seq_set_sim(seq_set1, seq_set2, k) that takes as arguments two sets of strings seq_set1 and seq_set2 and an integer k , and returns a floating point value between 0 and 1...

use python do the work in every space where it says #YOUR CODE HERE A Sock Drawer Let us model a sock drawer. How does a sock drawer work? Well, when you do the washing you end up with a bunch of...

Python 3 please solve as much as possible Exercise 4. Hailstone Sequence Write a function hailstone(n) that takes an integer parameter n (assume it is always > 0), and that displays the hailstone...

Lab Exercises: Question #1 Write the code for a Python function called loadDictionary that accepts a dictionary, a list, and a letter (with a default value of 'X') and stores the list in the...

1. List some disadvantages of trying to sell a home yourself. 2. List one advantage and one disadvantage of using a real estate broker to sell a home. 3. Describe two costs associated with selling a...

The Holtz Corporation acquired 80 percent of the 100,000 outstanding voting shares of Devine, Inc., for $6.90 per share on January 1, 2017. The remaining 20 percent of Devine's shares also traded...

2. Suppose you own a video game store. List some of the fixed inputs and variable inputs you would use in operating the store.

use Python 1. Carry out the following in IDLE: 7.8+0.2+8.9-2.1. What is the answer from Python? What is the exact answer? To get to the bottom of this, compute the machine epsilon: EM Defn: em, is...

Resolves conflicts through constructive dialogue to create solutions from which all benefit.

Establishes clear accountabilities for self and the team.

Demonstrates focus on the objectives and end results by effective timely delivery.