Poster for ECCB 2004 - COLUMBA - A database of annotated protein structures
COLUMBA - A database of annotated protein structures
Silke Trissl1, Kristian
Rother2, Ulf Leser
1trissl@informatik.hu-berlin.de, Institute of Informatics,
Humboldt-Universität zu Berlin, Germany;
2kristian.rother@charite.de, Institute of Biochemistry,
Charitè, Berlin, Germany
Short abstract:
We present COLUMBA, a database of information on protein structures that integrates data from twelve different biological databases, including ENZYME, KEGG, SCOP, CATH, DSSP, and SwissProt. COLUMBA allows for the quick computation of sets of protein structures that share interesting properties according to the different data sources.
Long abstract:
The number of protein structures
deposited in the Protein Data Bank, PDB (Berman et al. 2000) is
increasing rapidly. This allows researchers in life science to study
complex relationships between macromolecular structures and their
properties, such as biological function, folding classification, or
secondary structure. To undertake those studies, not only the three
dimensional (3D) structures have to be known, but also the folding
classification and several other properties of a protein. Gathering
such information from web resources by following hyperlinks is a
tedious and time-consuming task.
We have created COLUMBA (Rother et al. 2004), a database of
information on protein structures, that physically integrates
information from twelve different data sources into a single relational
data warehouse. We enrich the protein structures from the PDB
with
- structure classifications from SCOP and CATH,
- computed secondary structures from the DSSP program,
- functional annotation from ENZYME and GO,
- participation in metabolic pathways from KEGG and the Boehringer map,
- taxonomic information from the NCBI Taxonomy,
- and further information from SwissProt.
- In addition to that, each chain is assigned to a cluster of similar sequences by PISCES and SYSTERS.
We have created a user friendly web
interface, which is available at http://www.columba-db.de.
The web interface allows a full text search as well as data source
specific queries. The web interface uses a "query refinement" paradigm
to return a set of PDB entries, which fulfill the conditions stated. A
query is defined by entering restriction conditions in the form for the
data source specific annotation. The user can combine queries from
different data sources, which act as filters, to obtain the desired
subset of PDB entries. The interface supports interactive and
exploratory usage by straightforward adding, deleting, restricting, or
easing of conditions. The user is supported by a header, called "filter
chain", where the number of PDB entries after each filter step is
stated.
The result set, gives basic information on each entry returned. The
user can see the full scope of COLUMBA for a single entry where
all the annotated information for a single entry is shown.
Through the web interface it is fairly simple to answer the following two questions:
- Which structures contain chains with a TIM-barrel fold and have a resolution better than 2.0 Å.
- Which proteins in the citric acid cycle do have a resolved structure?
The first query is answered by first
entering the phrase 'TIM barrel' in one of the Protein Fold forms - as
fold in SCOP and as keyword in CATH, respectively, then enter for the
condition resolution '2.0' in the PDB Structure form. This will result
in the desired set of currently 370 entries for CATH and 381 for SCOP,
respectively.
The second query can be answered by using the Metabolism form of
COLUMBA. The option 'path coverage' not only shows the enzymes
participating in the selected pathway, but also the number of
structures known for each enzyme.
Berman, H.M., Westbrook, J.,
Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and
Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids
Research.28: 235 - 242.
Rother, K., Müller, H., Trissl, S., Koch, I., Steinke, T.,
Preissner, R., Frömmel, C., Leser, U. 2004. COLUMBA: Multidimensional
Data Integration of Protein Annotations. E. Rahm(Ed.): DILS 2004,
LNBI 2994, 156 - 171.