Computational Approaches to Protein Science

Genome sequencing projects have enormous potential for benefiting human endeavors. However, just as acquiring a language's vocabulary does not enable one to speak it, databases that list the amino acid composition of proteins do not directly tell us much about these proteins' higher-level structure and function. The most productive way to indirectly exploit these databases has been to start with the small number of proteins that are fully-characterised and to assume that other "similar" proteins will have a related structure and function. Proteins with very similar amino acid sequence are "no-brainers", but the real test, which our group largely focuses on, is to detect the "essential" similarity in proteins whose non-critical sections have experienced random rearrangements during evolution. In such cases functionally similar proteins may have less than 25% sequence overlap. To enable more complete tracing of protein family trees, we have developed and improved upon a wide range of computational methods: some can be applied to all proteins, others exploit characteristic features of specific protein types (e.g. the strong influence of disulphide bonds on the structure of extracellular proteins). These have been adapted into a number of widely used publicly-accessible web resources (e.g. DIAL, iMOT, MODIP, FMALIGN). Applying these and other techniques, we have also carried out within- and cross-genome surveys of the members of various entire protein families and superfamilies. Finally, we have been able to use our improved understanding of the functionally-significant regions of proteins for the theoretical prediction of protein function.