Question: Homework 5 : HMMs And Models ( 1 0 0 points total ) Assignment guidelines Submit your assignment files on canvas under module 1 1

Homework 5: HMMs And Models
(100 points total)
Assignment guidelines
Submit your assignment files on canvas under module 11: COVID and Ancient Genomes
Please submit your code in file(s) called [name].By- Your code should be easy to open in
a text editor so that someone can download and use the function you write.
Please submit a pdf with the answers to the questions at the bottom of the assignment
(and your visualization(s))
Please submit a text file with the output of your code
Complete the class assignment: Nucleotide Composition HMM (50 points)
Train the model on the files labled as "training". Test the model on the Files labled as pathogen
and Sgacil, In your output include a classification and a score for each sequence in both files.
Also, include a plot containing the score distributions of scores for pathogen and S8acil
sequences.
Complete the functions getLogLike and trainMedel (30 points)
a. Code meets specifications - the function exists and makes correct input and
output
i. Code includes getLoglike, (I) function which computes the correct log-
liklibeod. (15 points)
ii. Code includes traioModeld function which takes in the sequences of
pathogen and Spaciland measures model parameters (15 points)
Draw (by hand is fine) the HMM that represents our pathogen model. Label all the
transition states and indicate the emissions for each state (10 points)
Imagine that, due to incidental overlaps, when you assembled the Sqacii genome you
combined a bit of the parasite genome and your Spacilgenome into a single contig.
Describe how you would combine your models into a single HMM and use dynamic
programming to identify a likely merge point between the Spacil and parasite sequences
dynamique programming programming can be used to identify the most likely merge point.
This involves finding the best path through the combined HMM that
separates parasite and Spacil sequences.
#Basecount.py:
import math
import matplotlib.pyplot as plt #if you don't have matplotolib installed (this line gives you an error)
#you can comment out the line that starts with
plt. and plot a histogram with the data output
baseIDx ={"A":0,"C":1,"G":2,"T":3}
def main():
spaciiFA = "MSpacii.fa"
pathogenFA = "pathogen.fa"
spaciiFA_T = "MSpacii_training.fa"
pathogenFA_T = "pathogen_training.fa"
spaciiID2seq = getSeq(spaciiFA)
pathogenID2seq = getSeq(pathogenFA)
spaciiTrainModel =[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
pathTrainModel =[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
spaciiTrainModel = trainModel(spaciiTrainModel, spaciiFA_T)
pathTrainModel = trainModel(pathTrainModel, pathogenFA_T)
markovScoresSpacii =[]
markovScoresPath =[]
for ID in spaciiID2seq.keys():
markovScoresSpacii.append(getLogLike(spaciiTrainModel,pathTrainModel,spaciiID2seq[I
D]))
for ID in pathogenID2seq.keys():
markovScoresPath.append(getLogLike(spaciiTrainModel,pathTrainModel,pathogenID2seq[I
D]))
####----------------------output-------------------------
plt.hist([markovScoresPath,
markovScoresSpacii],bins=20,label=['pathogen','spacii'],rwidth=1,density=True)
scoresOutputText(markovScoresSpacii,markovScoresPath)
####----------------------output-------------------------
def scoresOutputText(markovScoresSpacii,markovScoresPath):
f = open("results.tab", "w")
f.write("SpaciiScores\tpathogenScores
")
for i in range (len(markovScoresSpacii)):
f.write(str(markovScoresSpacii[i])+"\t"+str(markovScoresPath[i])+"
")
f.close()
def getLogLike(model1, model2,seq): #takes in the two trained models and the
sequence that needs to be scored
Pmod1=1
Pmod2=1
#Please complete this function. This should return the log-likelihood of the
two models
#with Pmod1 and Pmod2 as the probabilities of the two models.
return score
def trainModel(model, data):
#Please complete this function. This should look at all the training data and
calculate how many
#dinucleotides preceed each base similar to what was outlined on the slides
#The ouput of the function should be a 4 x 4 matrix model where each row
represents the probability
#of seing base x given the previous base was y.(each row should sum to 1)
print(model)
return model
def getSeq(filename):
f = open(filename)
id2seq ={}
currkey =""
for line in f:
if line.find(">")==0:
currkey = line.rstrip()[1:]
id2seq[currkey]=""
else:
id2seq[currkey]= id2seq[currkey]+ line.rstrip()
return id2seq
main()
Homework 5 : HMMs And Models ( 1 0 0 points total

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!