Question: Your task is to create a Python program that reads DNA sequences from an input file ( below is a file example illustrating how the

Your task is to create a Python program that reads DNA sequences from an input file

(

below is a file example illustrating how the data is organized

)

and generates the consensus sequence. Additionally, an output file will be created that stores the nucleotide frequencies for each position, so as to help determine whether the consensus is indeed an accurate representation of the sequences.

.

DESCRIPTION:

Basic Algorithm: Here is the basic algorithm:

Load data from the input file into appropriate data structures

Count the nucleotide frequencies for all positions in all positions

Determine the consensus

Process results

Print consensus string to the screen

Write consensus string to output file

Write to output file frequencies of each nucleotide for each column

.

TECHNICAL REQUIREMENTS AND DATA DESCRIPTION:

Input file: DNA strings to be processed are to be read from a file named DNAInput.txt

.

The files have the following format: Description line, sequence line, description line, sequence line, and so on

.

Here

s a sample file:

>

biological

_

description

_1

GATCAGCTAG

>

biological

_

description

_2

AATCCGATCG

>

biological

_

description

_3

AATGCGCTAG

>

biological

_

description

_4

ACTCTGCGTG

. . .

and so on

. . .

Description lines always start with the

>

character; you may disregard these lines.

Note: These files are usually a FASTA file, but FASTA files can be read as plain text files, except that the file extension is either

.

fa or

.

fasta.

(

.

.

read the file in the same way you would read a

.

txt file

)

Output file: You will store the consensus sequence and the frequencies of the nucleotides in a file called DNAOutput.txt

.

For the sample input file provided above, here

s what the output file would contain:

Consensus: AATCCGCTAG

Pos

1

: A:

3

1

0

0

Pos

2

: A:

3

1

0

0

Pos

3

: T:

4

0

0

0

Pos

4

: C:

3

1

0

0

Pos

5

: C:

2

1

1

0

Pos

6

: G:

4

0

0

0

Pos

7

: C:

3

1

0

0

Pos

8

: T:

3

1

0

0

Pos

9

: A:

2

1

1

0

Pos

10

: G:

4

0

0

0

Note that the nucleotide sequences listed for each column are in non

-

increasing order by frequency. In case of a tie

(

when

2

different nucleotides have the same frequency

)

it doesn

t matter which one comes first in the output. For example, the last line in the previous example could have also been:

Pos

10

: G:

4

0

0

0

You may assume that:

Every combination of description

+

sequence takes up

2

lines

(1

line for each

) .

All sequences in the file have the same length. The exact length is not initially known; you may determine it from any of the sequences.

All nucleotides are in capital letters.

There will be no characters other than A

,

,

,

and G in the sequences.

There will be no ties for the most highly

-

occurring nucleotide in any column. This means that, when determining the consensus, there will be a single nucleotide that is the highest occuring.

You may NOT assume that:

The length of the DNA sequences will be

10,

as in the example.

The amount of DNA sequences will be

4,

as in the example.

Your code:

1 .

Create a function called load

_

data.

.

It takes as argument the name of the file to be used

(

a string

) .

.

It returns a data structure

(

or more than one

)

that contains all of the information from the input file.

2 .

Create a function called count

_

nucl

_

freq.

.

It takes as argument the data structure

(

)

generated by load

_

data.

.

It returns a new data structure

(

or more than one

)

that contains the frequencies of the nucleotides for each column in each sequence.

3 .

Create a function called find

_

consensus.

.

It takes as argument the data structure

(

)

generated by count

_

nucl

_

freq.

.

It returns a string; the consensus sequence.

4 .

Create a function named process

_

results.

.

It takes as arguments the data structure

(

)

created by count

_

nucl

_

freq and the name of the output file

(

a string

) .

.

It writes the results, in the format previously described, to the output file.

.

It doesn

t return anything.

Other Important information

1 .

Sample files are provided, but they are for testing purposes only. In other words, the sample DNAOutput.txt provided should be the result of executing your program with the sample file provided

(

DNAInput

.

fasta or DNAInput.txt

) .

Your program should be able to work with any FASTA file where all sequences are of the same length.

2 .

You should NOT prompt for the file name; you should ALWAYS try to open a file named DNAInput.txt and your output should ALWAYS be to a file named DNAOutput.txt

.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Programming Language: JAVA Package name: exer2 in this package, create the classes ServerExer2 and ClientExer2 Add a directory file named res and add a text file name data. For every line of this...

I'm trying to create a python program that reads the contents of two files and creates a two character pair combination that is stored in a list. A List of Lists is created of each possible protein...

Can you please do it in python PERSPECTIVE: We need practice in using files and strings and we don't have much time. Thus this is a short assignment. PROBLEM BACKGROUND: Assume you're working on a...

Background Assume youre working on a contract where a company is building an e-mailing list by analyzing e-mail messages to pull out some addresses. Your task is to write a python program that reads...

Your task in this challenge is to write a Python program that reads in a flowchart from a file and plays it to the user. The flowchart file format is simple and can be explained with an example....

Please do it in python PERSPECTIVE: We need practice in using files and strings and we don't have much time. Thus this is a short assignment. PROBLEM BACKGROUND: Assume you're working on a contract...

Do it in python PERSPECTIVE: We need practice in using files and strings and we don't have much time. Thus this is a short assignment. PROBLEM BACKGROUND: Assume you're working on a contract where a...

Question 2 : Data Analysis from a Text File using NumPy You're working as a data analyst for a research project that involves analyzing experimental data collected in a text file. The data includes...

Lab 2 Binge Watch Reviews In this lab, you re going to create a functionally decomposed Python program that reads TV show ratings from a file. It will create a list of TV shows, where a dictionary...

Discussion Boards (DB) are a key component of online learning. They foster active participation of learners and dialogue with fellow learners and instructors. Graduate-level courses require learners...

Brown Company contracts with Sebastian Company to exchange refrigerated trucks. Brown Company will trade three SMC trucks for four DROF trucks owned by Sebastian Company. The DROF refrigerated trucks...

Last year the Chester company increased their equity. In 2 0 2 4 their equity was $ 4 9 , 1 3 1 . Last year ( 2 0 2 5 ) it increased to $ 5 2 , 6 0 2 . What are causes of change in equity? Check all...

A small regional carrier accepted 11 reservations for a particular flight with 10 seats. 5 reservations went to regular customers who will arrive for the flight. Each of the remaining passengers will...