Question: Project 4 ( Tier 3 option ) : Median String Motif Search on AWS EMR Problem: Implement the median - string - based Motif search

Project 4(Tier 3 option): Median String Motif Search on AWS EMR
Problem: Implement the median-string-based Motif search algorithm (as explained in class and summarized in the attached slides) using MapReduce on AWS.
Input: Motif length l=8 and a sequence data file, named "promoters_data_clean.txt", is attached which consists of 106 sequences (each in a separate line).
Output: The output of your program must include the following items (each in a separate column):
the motif, i.e., the found consensus string (also called median string) which is the candidate having the minimum total matching distance (obviously, this shall be the same for all input sequences),
the best match of the motif found in each input sequence,
the sequence's id (i.e., the line number of the sequence in the input file),
the local matching distance (i.e., the distance between the motif and the best match found in each sequence),
the position index of the best math found in each input sequence (note: index starts from position 1),
the minimum total matching distance (this is the same for all input sequences).
A sample output is shown below:
Results
\table[[\table[[consensus string found],[Motif]],MatchMotif,SeqID,Dis,\table[[Position],[Index]],totalDis],[aacgcttt,atcgcttt,97,1,49,264],[aacgcttt,aacgggtc,104,3,20,264],[aacgcttt,aacaagat,102,4,8,264],[aacgcttt,aaggcttc,103,2,48,264],[aacgcttt,aatgcttt,100,1,30,264],[aacgcttt,accgctt,98,1,22,264],[aacgcttt,gaggctct,105,3,17,264],[aacgcttt,aacgggtc,101,3,15,264],[aacgcttt,gagggtgt,99,4,1,264],[aacgcttt,agtgctta,69,3,1,264]]
Hints: The vital first thing of applying the MapReduce framework to real world problems is to identify what the keys and values are. While there are more advanced approaches, the following hint is simply to inspire your creativity (don't refrain your creativity within the frame of this hint!!).
You can use the candidate median strings (of a total of 65536) as the keys, and the total matching distances of the respective candidates as the values. That means you will not get the keys from input but generate the keys (i.e., enumerating the candidate median strings) in your code. Your Map function outputs each median string paired with its total matching distance; your Reduce function reverses each key/value pair such as (:k,v:)(:v,k:). The output of Reduce will be a sorted list of the reversed pairs and the first pair has the minimum total matching distance and the motif you have found. Then you can start a second round MR to produce the required final
Project 4 ( Tier 3 option ) : Median String Motif

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!