Analyzing Bio-molecules Using Graphs, Geometric Invariants and Machine Learning

Ashish V. Tendulkar Indian Institute of Technology Department of Computer Science and Engineering Madras 6000
Tuesday, 2 Nov 2010 (all day)
A-212 (STCS Seminar Room)
The focus of the talk will be on analysis of protein structures using geometric, machine learning and graph theoretic techniques and its implications in functional site predictions. Proteins are versatile bio-molecules made up of amino acids and are involved in many cellular functions. They function through an arrangement, known as functional site, of a small number of spatially proximal amino acid residues (typically three to six) in the structure. Given a new protein, biologists are interested in determining its functional site using suitable experimental techniques. In doing so, they are faced with enormous number of choices due to combinatorial explosion and it is just not feasible to evaluate these choices exhaustively because of excessive time and resource requirements. To overcome the problem, we developed a method that provides the biologists with a small list of most likely functional sites, which will serve as a useful guide while designing their experiments.

In our scheme, we represent each protein structure as an unweighted undirected graph with amino acid residues being the nodes. The nodes are connected with an edge if the corresponding amino acid residues are spatially proximal in the structure. Each functional site is represented as a clique in this set up. We extract candidate functional sites from each protein using Bron-Kerbosch clique finding algorithm. Now, the objective is to determine most likely functional sites from a large number of candidate sites. The work is founded on the well characterized biological knowledge that the functionally important substructures are conserved and recur in functionally related proteins. We represent the candidate functional sites using geometric invariants, which remain unchanged upon transformations like rotation and translation. The candidate sites are grouped, using machine learning techniques, based on their similarity in a space spanned by geometric invariants. The recurring candidate sites are analyzed to provide a rank list of possible functional sites to the experimental biologists. Finally, I will present a few examples of successful application of this method in novel proteins.