Poster for ECCB 2004 - The usefulness of integrated databases
The usefulness of integrated databases: a case study of COLUMBA
Stefan Günther1, Kristian
Rother2, Silke Trissl
1stefanxguenther@web.de, Institute of Biochemistry, Charitè,
Berlin, Germany;
2kristian.rother@charite.de, Institute of Biochemistry,
Charitè, Berlin, Germany
Short abstract:
In our poster we want to show possible applications of COLUMBA by highlighting several aspects of the linkage quality in biological databases. Moreover we used the database to generate sets of PDB entries containing Protein-DNA complexes and their homologues to conduct research on motives of DNA-binding sides in proteins.
Long abstract:
Coverage of PDB with external information
We investigated to what extend PDB entries are covered by second party annotation. We found an 'annotation gap' for structures less than seven years old for each secondary database that is based on methods involving manual steps. The examined databases overlap each other well, dividing the PDB into two well- and one poorly annotated third. Poorly annotated structures are either very new or contain molecules often not desired in datasets, like mere DNA, small molecules, or low-resolution structures.
Analysis of DNA-binding regions
From the COLUMBA database we
retrieved a dataset of 650 Protein-DNA complexes and 555 higly
homologous DNA-binding proteins (> 90 % identity), which were
crystallized without nucleic acid. The advantage of using
COLUMBA compared to web resources was that the resulting dataset
contained no unwanted protein structures, except for two
immunoglobulins and one thrombin because of the possibility to enter a
more complex search phrase by using the SQL query language.
We excluded entries containing single nucleotides, or RNA as
nucleotide sequence. In addition to that, we removed entries containing
just a short peptide. From the remaining set we identified the DNA and
polypeptide chains, respectively. The dataset was cross-checked with
other publicly available Protein-DNA complex resources from PDB, NDB,
and IMB. We found that, according to our criteria, those sets contain
several wrongly identified complexes. Finally, all entries were
validated manually by visual inspection.
The final dataset was divided into 218 protein families according to
their sequence identity. The binding region of a polypeptide chain in a
complex was identified by accepting an amino acid with less than 5 Å
distance to a nucleotide chain as belonging to a binding region.
We then compared the three-dimensional structures of the binding
regions with the Needle-Haystack search algorithm (Hoppe & Frömmel
2003) to find interesting properties.
We found that within a family the local conformation of DNA binding
regions can be strongly diverse as long as no DNA is bound to that
region. As soon as the DNA is complexed together with the protein, a
family shares a common subfold for that binding pocket. In addition to
that, we used the binding pockets of helix-turn-helix proteins, which
represent a prominent motive of DNA binding, for a similarity screening
against the entire dataset. We found closely related proteins, however,
the specificity of matches does not allow to reasonably identify
distant relatives.
References
Hoppe, A. and Frömmel, C. 2003. NeedleHaystack: a program for the rapid recognition of local structures in large sets of atomic coordinates. J.Appl.Cryst. 36(4):1090 - 1097.