Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Fighting the Data Acquisition Bottleneck

Fighting the Data Acquisition Bottleneck

Sebastian Schmeier1, Axel Kowald1, Jörg Hakenberg2*, Ulf Leser2, and Edda Klipp1

1 Max-Planck-Institute for Molecular Genetics, Kinetic Modeling Group, 14195 Berlin, Germany;
2 Humboldt-Universität zu Berlin, Dept. Computer Science, Knowledge Management in Bioinformatics, 12489 Berlin, Germany;
* Corresponding author. Current affiliation: Knowledge Management in Bioinformatics, Dept. Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany. Phone: +49.30.2093.3903, eMail: hakenberg(a)informatik.hu-berlin.de

Abstract

Quantitatively modeling complex biological systems requires a large number of specific stoichiometric coefficients and kinetic constants. Only a small fraction of the respective data is available in databases [1]. Most is hidden in a constantly and quickly growing number of scientific publications, thus not being available for computation. It is impossible for researchers to keep up with this data flood without support of computers.

We develop a text mining system supporting systems biology research by finding relevant publications [2]. We use a corpus of 800 manually annotated full text documents to train a support vector machine to recognize "interesting" publications, i.e., those that contain facts from wet lab experiments and not only theoretical insights. We classify new documents based on linguistic analysis of their content. Using 5-fold cross-validation, we estimate our classifier to reach 60% precision at 50% recall. We compare our approach to a pure keyword-style search procedure and find it to outperform the latter by a factor of five in terms of precision.

This shows that text mining can play an important role in supporting research in systems biology. Our system is further developed to not only find interesting publications, but to also extract the relevant data in a semi-automatic process and to store them in a newly created database.


References

[1] SCHOMBURG, I., CHANG, A., and SCHOMBURG, D. (2002).
BRENDA, Enzyme Data and Metabolic Information. Nucleic Acids Research, 30:47-49.
[2] HAKENBERG, J., SCHMEIER, S., KOWALD, A., KLIPP, E., and LESER, U. (2004).
Finding Kinetic Parameters Using Text Mining. OMICS, Special Issue: Data Mining meets Integrative Biology - A Symbiosis in the Making. Invited submission, to appear.

Published in
Int Conf Sys Biol (ICSB) 2004. Heidelberg, Germany. Accepted.