Information Theory of DNA Shotgun Sequencing

Varun Narayanan
Anand Deo
Friday, 14 Oct 2016, 16:00 to 17:30
A-201 (STCS Seminar Room)
DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is: given a sequencing technology and the statistics of the DNA sequence, what is the minimum number of reads required for reliable reconstruction? This number provides a fundamental limit to the performance of any assembly algorithm. In this seminar, we will discuss a paper with the above title by David Tse from 2013. For a simple statistical model of the DNA sequence and the read process, We find that the answer admits a critical phenomenon in the asymptotic limit of long DNA sequences: if the read length is below a threshold, reconstruction is impossible no matter how many reads are observed, and if the read length is above the threshold, having enough reads to cover the DNA sequence is suffi¬Ācient to reconstruct. The threshold is computed in terms of the Renyi entropy rate of the DNA sequence.