Question: ( a ) DNA sequences have four possible bases: C G A T . Below is an example of a part of a DNA sequence

(a) DNA sequences have four possible bases: C G A T. Below is an
example of a part of a DNA sequence is shown.
GATCCTCCAT ATACAACGGT ATCTCCACCT
CAGGTTTAGA TCTCAACAAC GGAACCATTG
(i) Consider an ASCII encoding of the example below (ignoring
spaces between bases), what is the number of bits used to store
the example in memory?
(ii) Give a better suited encoding considering that DNA sequences
only contain four possible bases.
(iii) What is the compression ratio obtained by using the encoding you
chose in (ii) instead of the ASCII encoding?
(b)(i) Consider the following shorter example:
ATATCGCATC
Perform LZW compression on this short example using the
following initial dictionary:
0,1,2,3
C,A,G,T
Show the dictionary constructed during compression and the
compressed data.
(ii) Using the same dictionary, expand the following compressed
data:
0,2,1,4,4,5,2
Show the dictionary after uncompressing each code and the
uncompressed sequence.
(iii) What is the compression ratio obtained on these small examples
from the ASCII encoding and considering that 8-bits integers are
used to encode the compressed string? Why is this result different
from the average ratio of 4 obtained on typical human genome
(which is a DNA sequence of about 3,000 megabytes of
uncompressed data)? solve all the questions
( a ) DNA sequences have four possible bases: C G

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!