Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Poster for ECCB 2004 - The usefulness of integrated databases

The usefulness of integrated databases: a case study of COLUMBA


Stefan Günther1, Kristian Rother2, Silke Trissl
1stefanxguenther@web.de, Institute of Biochemistry, Charitè, Berlin, Germany;
2kristian.rother@charite.de, Institute of Biochemistry, Charitè, Berlin, Germany

Short abstract:

In our poster we want to show possible applications of COLUMBA by highlighting several aspects of the linkage quality in biological databases. Moreover we used the database to generate sets of PDB entries containing Protein-DNA complexes and their homologues to conduct research on motives of DNA-binding sides in proteins.


Long abstract:

We have used COLUMBA, a database of annotated protein structures (Rother et al., 2004) to highlight several aspects of the linkage quality in biological databases. In addition to that we used COLUMBA to generate sets of PDB entries containing Protein-DNA complexes and their homologues to conduct research on motives of DNA-binding sides in a protein.

Coverage of PDB with external information

We investigated to what extend PDB entries are covered by second party annotation. We found an 'annotation gap' for structures less than seven years old for each secondary database that is based on methods involving manual steps. The examined databases overlap each other well, dividing the PDB into two well- and one poorly annotated third. Poorly annotated structures are either very new or contain molecules often not desired in datasets, like mere DNA, small molecules, or low-resolution structures.

Analysis of DNA-binding regions

From the COLUMBA database we retrieved a dataset of 650 Protein-DNA complexes and 555 higly homologous DNA-binding proteins (> 90 % identity), which were crystallized without nucleic acid. The advantage of using COLUMBA compared to web resources was that the resulting dataset contained no unwanted protein structures, except for two immunoglobulins and one thrombin because of the possibility to enter a more complex search phrase by using the SQL query language.
We excluded entries containing single nucleotides, or RNA as nucleotide sequence. In addition to that, we removed entries containing just a short peptide. From the remaining set we identified the DNA and polypeptide chains, respectively. The dataset was cross-checked with other publicly available Protein-DNA complex resources from PDB, NDB, and IMB. We found that, according to our criteria, those sets contain several wrongly identified complexes. Finally, all entries were validated manually by visual inspection.
The final dataset was divided into 218 protein families according to their sequence identity. The binding region of a polypeptide chain in a complex was identified by accepting an amino acid with less than 5 Å distance to a nucleotide chain as belonging to a binding region.
We then compared the three-dimensional structures of the binding regions with the Needle-Haystack search algorithm (Hoppe & Frömmel 2003) to find interesting properties.
We found that within a family the local conformation of DNA binding regions can be strongly diverse as long as no DNA is bound to that region. As soon as the DNA is complexed together with the protein, a family shares a common subfold for that binding pocket. In addition to that, we used the binding pockets of helix-turn-helix proteins, which represent a prominent motive of DNA binding, for a similarity screening against the entire dataset. We found closely related proteins, however, the specificity of matches does not allow to reasonably identify distant relatives.

References

Rother, K., Müller, H., Trissl, S., Koch, I., Steinke, T., Preissner, R., Frömmel, C., Leser, U. 2004. COLUMBA: Multidimensional Data Integration of Protein Annotations. E. Rahm(Ed.): DILS 2004, LNBI 2994, 156 - 171.
Hoppe, A. and Frömmel, C. 2003. NeedleHaystack: a program for the rapid recognition of local structures in large sets of atomic coordinates. J.Appl.Cryst. 36(4):1090 - 1097.