Performance Modeling of Genomics Workflows

Mr. Subho Sankar Banerjee
R K Shyamasundar
Friday, 10 Jan 2014, 15:00 to 16:00
B-333 (DBS Seminar Room)
Abstract: Technological advances in sequencing, mapping, and analyzing genomes is proceeding at an extremely rapid pace, and the resulting explosion of genomics data is becoming difficult to manage. Sequencing human genomes would quickly add up to hundreds of petabytes of data, and the data created by analysis of gene interactions multiplies those further. Recently, the University of Illinois received a grant from the NSF to build a platform (CompGen) that will allow us to keep ahead of this growth of data. The CompGen initiative seeks to leverage the strengths of Illinois’ genomic research and that of building large-scale parallel systems, to develop new technology that allows us to analyze this data more accurately in a shorter periods of time.

Our approach to this problem is many-fold - defining new algorithmic techniques, constructing better metrics for defining accuracy, extracting parallelism by defining primitives common to these algorithms, use of accelerators to achieve faster processing, studying the effect of architectural changes on the performance of these applications. To unify all of these perspectives and define quantitatively how each of them effect the system in terms of accuracy and performance, we construct models that describe the applications and system so that, we can extrapolate the performance metrics to yet to be built systems. We believe that analyzing such models will give us an understanding of what factors are critical in the design of such a system. The problem of designing the CompGen system is one of optimization and the information derived from the model will reduce the complexity of this problem by reducing the design space of the system.