Considering the large volume of sequence data that many computational biology applications must deal with, proper organization of the data to facilitate fast access is important to achieve desirable run-times. From this perspective, string data structures serve the same purpose for biological sequence data as binary search trees serve for ordered numeric data, and quadtrees serve for spatial data.
String data structures are ideal for uncovering exact matching patterns in sequences. They are also useful when performing approximate matches where only a small number of differences are permitted.
We will try to provide a detailed introduction to the three most frequently used string data structures in computational molecular biology — lookup tables, suffix trees and suffix arrays. The focus will be on algorithms for constructing these data structures. We will also explore the relationships between these data structures and provide several illustrations of biological applications where suffix trees play a central role.