Towards detailed and consistent function prediction from protein family databases

-
Investigator: Patricia C. Babbitt, PhD
Sponsor: University of Southern California

Location(s): United States

Description

Significance. 
Thanks to continuing developments in DNA sequencing technology, we now know the exact genetic makeup ("genome") of thousands of different organisms, encoding millions of different proteins. But simply knowing the chemical specification (the "sequence") of these proteins is only a first step-the ultimate goal is to discover how genes and proteins function to support the diversity of life, and also how some of them can be used for commercial and biotechnology applications. This research project will expand the capability of scientists and their students, to advance their analyses from sequences to functions, by bringing together multiple different state-of-the-art approaches. Each of these approaches uses both computational (necessary to address a problem of this magnitude) and broad biological expertise. 

Approach. 
The general approach in this project is to classify proteins into families of related proteins, and, wherever possible, describe how each family relates to function. The ultimate goal is to assign the same specific function to all of the proteins in a family or to subsets of the family if more than one function is represented within the family. These relations may be very complex, and scientific accuracy will require application of multiple, diverse methods. In order to accomplish this aim, the project will expand InterPro, a widely used resource that already contains (though with limited data integration mechanisms) eleven different databases, three of which are involved in this project: PANTHER, Pfam and TIGRFAM. A fourth classification resource, the Structure-Function Linkage Database (SFLD), will also be incorporated into InterPro. These four databases use complementary methodologies to represent and describe protein relationships, which will be integrated to address the problem of protein function classification with unprecedented accuracy, precision and ease-of-use. As proteins do not generally work in isolation, additional structured annotations relating to pathways and complexes will be added to sets of families, to defined functional characteristics present in a genome. The products of this work will be used to enhance sequence analysis tools used by the scientific community, as well as to provide enhanced educational materials, and will be broadly accessible over the web at https://www.ebi.ac.uk/interpro/.

Technical Summary

Scientists desperately need effective methods to better decode, organize, and more fully exploit still rapidly increasing sequencing data. Classification of proteins into hierarchical families that deliver meaningful functional assignments offers one primary solution. InterPro, one of the most widely used resources for protein family annotation represents 11 different databases, including Pfam, TIGRFAM and PANTHER, combined into a single resource providing value added annotations. A third classification resource, the Structure-Function Linkage Database (SFLD), will be incorporated into InterPro as part of the work proposed here. These databases approach the problem of functional annotation using complementary methodologies, which we propose to combine to address this problem with unprecedented accuracy, precision and ease-of-use.