• Statistical Analysis of Biological Interactions from Homologous Proteins

      Obradovic, Zoran; Dunbrack, Roland L.; Vucetic, Slobodan; Latecki, Longin; Coico, Richard (Temple University. Libraries, 2008)
      Information fusion aims to develop intelligent approaches of integrating information from complementary sources, such that a more comprehensive basis is obtained for data analysis and knowledge discovery. Our Protein Biological Unit (ProtBuD) database is the first database that integrated the biological unit information from the Protein Data Bank (PDB), Protein Quaternary Server (PQS) and Protein Interfaces, Surfaces and Assemblies (PISA) server, and compared the three biological units side-by-side. The statistical analyses show that the inconsistency within these databases and between them is significant. In order to improve the inconsistency, we studied interfaces across different PDB entries in a protein family using an assumption that interfaces shared by different crystal forms are likely to be biologically relevant. A novel computational method is proposed to achieve this goal. First, redundant data were removed by clustering similar crystal structures, and a representative entry was used for each cluster. Then a modified k-d tree algorithm was applied to facilitate the computation of identifying interfaces from crystals. The interface similarity functions were derived from Gaussian distributions fit to the data. Hierarchical clustering was used to cluster interfaces to define the likely biological interfaces by the number of crystal forms in a cluster. Benchmark data sets were used to determine whether the existence or lack of existence of interfaces across multiple crystal forms can be used to predict whether a protein is an oligomer or not. The probability that a common interface is biological is given. An interface shared in two different crystal forms by divergent proteins is very likely to be biologically important. The interface data not only provide new interaction templates for computational modeling, but also provide more accurate data for training sets and testing sets in data-mining research to predict protein-protein interactions. In summary, we developed a framework which is based on databases where different biological unit information is integrated and new interface data are stored. In order for users from the biology community to use the data, a stand-alone software program, a web site with a user-friendly graphical interface, and a web service are provided.