Question: Your task is to create a Python program that reads DNA sequences from an input file ( below is a file example illustrating how the
Your task is to create a Python program that reads DNA sequences from an input file below is a file example illustrating how the data is organized and generates the consensus sequence. Additionally, an output file will be created that stores the nucleotide frequencies for each position, so as to help determine whether the consensus is indeed an accurate representation of the sequences.
B DESCRIPTION:
Basic Algorithm: Here is the basic algorithm:
Load data from the input file into appropriate data structures
Count the nucleotide frequencies for all positions in all positions
Determine the consensus
Process results
Print consensus string to the screen
Write consensus string to output file
Write to output file frequencies of each nucleotide for each column
C TECHNICAL REQUIREMENTS AND DATA DESCRIPTION:
Input file: DNA strings to be processed are to be read from a file named DNAInput.txt The files have the following format: Description line, sequence line, description line, sequence line, and so on Heres a sample file:
biologicaldescription
GATCAGCTAG
biologicaldescription
AATCCGATCG
biologicaldescription
AATGCGCTAG
biologicaldescription
ACTCTGCGTG
and so on
Description lines always start with the character; you may disregard these lines.
Note: These files are usually a FASTA file, but FASTA files can be read as plain text files, except that the file extension is either fa or fasta.
ie read the file in the same way you would read a txt file
Output file: You will store the consensus sequence and the frequencies of the nucleotides in a file called DNAOutput.txt For the sample input file provided above, heres what the output file would contain:
Consensus: AATCCGCTAG
Pos : A: G: C: T:
Pos : A: C: G: T:
Pos : T: A: C: G:
Pos : C: G: A: T:
Pos : C: A: T: G:
Pos : G: A: C: T:
Pos : C: A: G: T:
Pos : T: G: A: C:
Pos : A: C: T: G:
Pos : G: A: C: T:
Note that the nucleotide sequences listed for each column are in nonincreasing order by frequency. In case of a tie when different nucleotides have the same frequency it doesnt matter which one comes first in the output. For example, the last line in the previous example could have also been:
Pos : G: C: T: A:
You may assume that:
Every combination of descriptionsequence takes up lines line for each
All sequences in the file have the same length. The exact length is not initially known; you may determine it from any of the sequences.
All nucleotides are in capital letters.
There will be no characters other than A C T and G in the sequences.
There will be no ties for the most highlyoccurring nucleotide in any column. This means that, when determining the consensus, there will be a single nucleotide that is the highest occuring.
You may NOT assume that:
The length of the DNA sequences will be as in the example.
The amount of DNA sequences will be as in the example.
Your code:
Create a function called loaddata.
a It takes as argument the name of the file to be used a string
b It returns a data structure or more than one that contains all of the information from the input file.
Create a function called countnuclfreq.
a It takes as argument the data structures generated by loaddata.
b It returns a new data structure or more than one that contains the frequencies of the nucleotides for each column in each sequence.
Create a function called findconsensus.
a It takes as argument the data structures generated by countnuclfreq.
b It returns a string; the consensus sequence.
Create a function named processresults.
a It takes as arguments the data structures created by countnuclfreq and the name of the output file a string
b It writes the results, in the format previously described, to the output file.
c It doesnt return anything.
Other Important information
Sample files are provided, but they are for testing purposes only. In other words, the sample DNAOutput.txt provided should be the result of executing your program with the sample file provided DNAInputfasta or DNAInput.txt Your program should be able to work with any FASTA file where all sequences are of the same length.
You should NOT prompt for the file name; you should ALWAYS try to open a file named DNAInput.txt and your output should ALWAYS be to a file named DNAOutput.txt
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
