Question: LAB TASK: Implement a program in python that will open a FASTA file, concatenate its multiline sequences into single strings, store them in a dictionary

LAB TASK: Implement a program in python that will open a FASTA file, concatenate its multiline sequences into single strings, store them in a dictionary using the sequence ID from the sequence header (value between the | symbols) as a key, and then print the IDs and sequences as two columns in a new file.

OBJECTIVE(S):

1. Write your code in the block below. Download the file called myoglobin.fasta, and make sure to save it in the same location as your lab task script.

2. Create an empty dictionary to store sequence information.

3. Using the open function, open the FASTA file (myoglobin.fasta).

4. When you find a line beginning with the > character (a header) extract the ID code between the | symbols and start a new dictionary entry using the ID as a key.

5. If a line isnt a header (i.e. it is a sequence), strip off the newline character at the end and append the sequence to a growing string (to the growing sequence that is the dictionary value) stored within the most recent dictionary key.

6. Close the original file.

7. Open a new file for writing, e.g. myoglobin_processed.txt.

8. Loop through the dictionary and write the ID keys and their corresponding sequences to the new file, separating them with a tab (\t) to generate two columns.

9. Close the new file.

10. Run your script. Upload the script and output (myoglobin_processed.txt) for lab credit. Dont forget comments!

Expected output for two sequences should look like this (note how the sequence now is a single string):

P02189 MGLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGNTVLTALGGILKKKGHHEAELTPLAQSHATKHKIPVKYLEFISEAIIQVLQSKHPGDFGADAQGAMSKALELFRNDMAAKYKELGFQG P04247 MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSEDLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRHSGDFGADAQGAMSKALELFRNDIAAKYKELGFQG

The provided myoglobin.fasta file I have to work with contains the following:

>sp|P02192|MYG_BOVIN Myoglobin OS=Bos taurus GN=MB PE=1 SV=3 MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASE DLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKIPVKYLEFISDAIIHVLHAKH PSDFGADAQAAMSKALELFRNDMAAQYKVLGFHG >sp|P02189|MYG_PIG Myoglobin OS=Sus scrofa GN=MB PE=1 SV=2 MGLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE DLKKHGNTVLTALGGILKKKGHHEAELTPLAQSHATKHKIPVKYLEFISEAIIQVLQSKH PGDFGADAQGAMSKALELFRNDMAAKYKELGFQG >sp|P02144|MYG_HUMAN Myoglobin OS=Homo sapiens GN=MB PE=1 SV=2 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH PGDFGADAQGAMNKALELFRKDMASNYKELGFQG >sp|P68082|MYG_HORSE Myoglobin OS=Equus caballus GN=MB PE=1 SV=2 MGLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASE DLKKHGTVVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISDAIIHVLHSKH PGDFGADAQGAMTKALELFRNDIAAKYKELGFQG >sp|P04247|MYG_MOUSE Myoglobin OS=Mus musculus GN=Mb PE=1 SV=3 MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSE DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRH SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG >sp|P02197|MYG_CHICK Myoglobin OS=Gallus gallus GN=MB PE=1 SV=4 MGLSDQEWQQVLTIWGKVEADIAGHGHEVLMRLFHDHPETLDRFDKFKGLKTPDQMKGSE DLKKHGATVLTQLGKILKQKGNHESELKPLAQTHATKHKIPVKYLEFISEVIIKVIAEKH AADFGADSQAAMKKALELFRNDMASKYKEFGFQG

This is my code so far:

file = open('myoglobin.fasta','r') l={} for line in file: if line[0]=='>': m=line.split("|") m[2]=m[2].rstrip() l[m[1]]=m[2] c=m[1] else: line=line.rstrip() l[c]=l[c]+line file.close() out=open('myoglobin_processed.txt','w') for key in l: out.write(key) out.write("\t") out.write(l[key]) out.write(" ") out.close()

This is my current output, but it's not correct. I don't know how to get it to look like the expected output which is described above. Any help would be great!

LAB TASK: Implement a program in python that will open a FASTA

1 P02192- MYG_BOVIN Myoglobin OS=Bos taurus GN=MB PE=1 SV=3MGLSDGEWOLVL NAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASEDLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKIPVKYLEFISDAITHVLHAKHPSDFGADAQAAMSKALEL FRNDMAAQYKVLGFHG 2 P02189 MYG_PIG Myoglobin Os=Sus scrofa GN=MB PE=1 SV=2MGLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGNTVLTALGGILKKKGHHEAELTPLAQSHATKHKIPVKYLEFISEAIIQVLOSKHPGDFGADAQGAMSKALEL FRNDMAAKYKELGFOG P02144 MYG_HUMAN Myoglobin OS=Homo sapiens GN=MB PE=1 SV=2MGL SDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLOSKHPGDFGADAQGAMNKALEL FRKDMASNYKELGFOG P68082- MYG_HORSE Myoglobin os=Equus caballus GN=MB PE-1 SV=2MGLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASEDLKKHGTVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISDAITHVLHSKHPGDFGADAQGAMTKALEL FRNDIAAKYKELGFOG 5 P04247 MYG_MOUSE Myoglobin OS-Mus musculus GN=Mb PE=1 SV=3MGLSDGEWQLVL NVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSEDLKKHGCTVLTALGTILKKKGQHAAE IOPLAQSHATKHKIPVKYLEFISEIIIEVLKKRHSGDFGADAQGAMSKALEL FRNDIAAKYKELGFOG 6 P02197- MYG_CHICK Myoglobin OS-Gallus gallus GN=MB PE=1 SV=4MGLSDQEWQQVLTIWGKVEADIAGHGHEVLMRLFHDHPETLDRFDKFKGLKTPDOMKGSEDLKKHGATVLTQLGKILKQKGNHESELKPLAQTHATKHKIPVKYLEFISEVIIKVIAEKHAADFGADSQAAMKKALEL FRNDMASKYKEFGFQG

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Working with FASTA data Modules you can use: sys (Links to an external site.) collections (Links to an external site.) os (Links to an external site.) re (Links to an external site.) argparse (Links...

working with fast files, create 4 python scripts named nt_fasta_stats.py, secondary_structure_splitter.py, test_nt_fasta_stats.py , test_secondary_structure_splitter.py, .coveragerc, README.md...

REFER TO 1 QUESTION POSTED BEFORE:...

Problem Statement: This question asks to write a script to obtain all protein sequences coded in the human genome in the multiple FASTA format, using the RefSeq table obtained from the UCSC Table...

I need help with this lab assignment it must be in python Dave Jeffery will be at a conference on March 16, 2018. Class will meet in ARSC 124 at the normal time to work on Lab 9. Blackboard will...

You must implement the following functions. Name the functions exactly as instructed below, and provide the same arguments and call them in the same context as instructed. Failure to do so will...

Mates Rates Rent-A-Car ( just do the part a) using visual studio code (C#) Criteria sheet - Par A Example supplementary files (readme.pdf) Example supplementary files (class-diagram.pdf) Assignment...

Part 2 : DNA Sequences In the field of bioinformatics, the FASTA format is a text - based format for representing either nucleotide sequences or amino acid ( protein ) sequences, in which nucleotides...

CS 112 Project 5 Dictionaries and File IO Due Date: Sunday, April 23rd, 11:59pm Last chance to use tokens! (P6 won't allow late submissions) The purpose of this assignment is to explore dictionaries...

On April 1", Grace's Printing has Work in Process inventory of $2,700, Raw Materials inventory is $1,500 and Manufacturing Overhead has a $300 credit balance. Subsidiary data for WIP on 4/1 includes:...

What can existing businesses learn from the business approaches of the dot-com organizations?

The _ _ _ _ _ _ _ _ _ _ _ _ _ measures the activity, or liquidity, of a firm's inventory. Average collection period Inventory turnover Quick ratio Current ratio

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

4. Which of the theories or models do you think fit best with which employees and why?

4. Who should be invited to attend?

7. How will you encourage her to report back on the findings?