Optimizing Syntax-Patterns for Discovering Protein-Protein-Interactions
Optimizing Syntax-Patterns for Discovering Protein-Protein-Interactions
Conrad Plake1, Jörg Hakenberg1*, and Ulf Leser1
1 Humboldt-Universität zu Berlin, Department of
Computer Science, Knowledge Management Group
* Corresponding author. Current affiliation: Knowledge
Management in Bioinformatics, Dept. Computer Science,
Humboldt-Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin,
Germany. Phone: +49.30.2093.3903, eMail:
hakenberg(a)informatik.hu-berlin.de
Abstract
We propose a method for automated extraction of protein-protein interactions from scientific text. Our system matches sentences against syntax patterns typically describing protein interactions. We define a set of 22 patterns, each a regular expression consisting of anchor positions and parameterizable constraints. This small set is then refined and optimized using a genetic algorithm on a training set. No heuristic definitions are necessary, and the final pattern set can be generated completely without manual curation. Our method can be applied to any syntax pattern-based protein-protein interaction system and thus complements related work on building comprehensive sets of such patterns. The application of different fitness-functions during evolution provides an easy way to tune the system either toward precision, recall, or f-measure. We evaluate our system on two samples, one derived from the BioCreAtIvE corpus, the other from references in the DIP. The automatical refinement of patterns adds up to 16% to the precision, and 5% to the recall of our system. We additionally study the impact of a proper protein name recognition, which could improve precision by about 17% and recall by 14%.
Published in
Proceedings of the ACM Symposium on Applied Computing, SAC 2005,
Bioinformatics Track. Volume 1, pp. 195-201. Santa Fe, USA, March
2005.
[SAC 2005] - [Bioinformatics
track]
@InProceedings{Plake:2005a, author = {Conrad Plake and J\"org Hakenberg and Ulf Leser}, title = {Optimizing Syntax-Patterns for Discovering Protein-Protein-Interactions}, booktitle = {Proc ACM Symposium on Applied Computing, SAC, Bioinformatics Track}, volume = 1, pages = {195-201}, address = {Santa Fe, USA}, month = {March}, year = 2005 }