Question: Project 4 ( Tier 3 option ) : Median String Motif Search on AWS EMR Problem: Implement the median - string - based Motif search
Project Tier option: Median String Motif Search on AWS EMR
Problem: Implement the medianstringbased Motif search algorithm as explained in class and summarized in the attached slides using MapReduce on AWS.
Input: Motif length and a sequence data file, named "promotersdataclean.txt is attached which consists of sequences each in a separate line
Output: The output of your program must include the following items each in a separate column:
the motif, ie the found consensus string also called median string which is the candidate having the minimum total matching distance obviously this shall be the same for all input sequences
the best match of the motif found in each input sequence,
the sequence's id ie the line number of the sequence in the input file
the local matching distance ie the distance between the motif and the best match found in each sequence
the position index of the best math found in each input sequence note: index starts from position
the minimum total matching distance this is the same for all input sequences
A sample output is shown below:
Results
tabletableconsensus string foundMotifMatchMotif,SeqID,Dis,tablePositionIndextotalDisaacgctttatcgcttt,aacgctttaacgggtc,aacgctttaacaagat,aacgctttaaggcttc,aacgctttaatgcttt,aacgctttaccgctt,aacgctttgaggctct,aacgctttaacgggtc,aacgctttgagggtgt,aacgctttagtgctta,
Hints: The vital first thing of applying the MapReduce framework to real world problems is to identify what the keys and values are. While there are more advanced approaches, the following hint is simply to inspire your creativity dont refrain your creativity within the frame of this hint!!
You can use the candidate median strings of a total of as the keys, and the total matching distances of the respective candidates as the values. That means you will not get the keys from input but generate the keys ie enumerating the candidate median strings in your code. Your Map function outputs each median string paired with its total matching distance; your Reduce function reverses each keyvalue pair such as :::: The output of Reduce will be a sorted list of the reversed pairs and the first pair has the minimum total matching distance and the motif you have found. Then you can start a second round to produce the required final
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
