Question: This is a c program. Thanks ----- Superstring Assembly Suppose that multiple copies of a long string are cut up into much smaller pieces. If

This is a c program.

Thanks

-----

Superstring Assembly

Suppose that multiple copies of a long string are cut up into much smaller pieces. If you are given the available fragments, is it possible to rebuild the larger string? For example, if the original string was "algorithmsarefunfunfun", and you got given a list of fragments like "refun" and "unfun" "rith" and "lgo" and so on, plus lots more small fragments, could you rebuild the original string? Probably you could, right? But what about if the original string was millions of characters long, like War and Peace? Or billions of characters long, like your genetic sequence?

Lets ask a slightly different question now: if you got given a set of small string fragments each from a larger string that you didnt know anything about, could you at least find the shortest string in which every supplied fragment occurs at least once? This problem is known as the shortest superstring problem, and can be thought of as having parameters n, the number of string fragments provided, and m, the total length of those string fragments. The shortest superstring problem is very important in bioinformatics, where genome reassembly from short fragments called reads helps contribute to our understanding of the behavior of organisms.

Unfortunately, finding optimal solutions to the shortest superstring problem has been shown to be very challenging, and falls into a family of intractable problems that well talk about later in the semester. In this project you will implement some approximation techniques that generate non-optimal superstrings from a collection of string fragments.

Input Data

Input to your program will (always) come from stdin, and will (always) consist of strictly lower-case alphabetic strings, one per input line, each line terminated by a single newline ' ' character. Be sure to read the message on the FAQ page about newlines. For example, the following file test0.txt containing six fragments is a valid input:

accgtcgatg

gcctag

gtacctacta

cgatgcc

tcgatgccgca

atgagaccgtc

As you develop your program according to the stages listed below, the output will evolve. Full output examples for this and two other test files are linked from the FAQ page. You should also check your program against other inputs too; and can easily generate your own tests using a text editor. Testing and debugging is your responsibility.

1

In terms of developing your program for this project, you may assume that no string fragment will be longer than 20 characters, and that there will be at most 1;000 such fragments supplied. You may also use brute-force programming rather than try and use more elegant data structures and algorithms this activity is primarily about C programming rather than algorithm design.

Stage 0 Reading Strings (0/15 marks)

Before tackling the more interesting parts of the project, your very first program should read the set of input fragments into an array of strings, and then print them out again so that you can be sure you have read them correctly. You can do that as part of your DEBUG mode code, see the suggestion below. Each of the following stages will then require a function that carries out the required processing, in each case working through the strings that were read, and then building (and selectively writing) the required superstring.

Stage 1 Overlapping Suffixes (8/15 marks)

In this first stage you are to build a superstring according to the following process: start with the first of the input fragments and use it to initialize a partial superstring; then process every other fragment in input order, checking first to see if it is already wholly present in the partially built superstring, and if it is not, checking to see if there is any overlap between the head of the new fragment and the tail of the superstring. If the fragment appears already within the superstring, no further action is needed. Or, if there are overlapping characters, the required part of the fragment is appended to the end of the partial superstring. If there is no tail overlap, the whole of the fragment is appended to the partial superstring. For example, on the test0.txt file, the six input fragments result in this sequence of superstrings being formed:

Stage 0 Output

--------------

6 fragments read, 55 characters in total

Stage 1 Output

--------------

frg= 0, slen= 10 Accgtcgatg

frg= 1, slen= 15 AccgtcgatGcctag

frg= 2, slen= 24 AccgtcgatGcctaGtacctacta

frg= 3, slen= 24 AccgtCgatGcctaGtacctacta

frg= 4, slen= 35 AccgtCgatGcctaGtacctactaTcgatgccgca

frg= 5, slen= 45 AccgtCgatGcctaGtacctactaTcgatgccgcAtgagaccgtc

where the three numbers shown are the operation number, the number of the fragment that is being added, and the new length of that partial superstring. Note how the letters that are at the start of each fragment have been capitalized in the output so that they can be identified in the superstring, and how one of the fragments was found within the partial superstring. At the end of the process there are 45 characters in the superstring, a saving of 10 compared to the original input size of m = 55 characters.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!