Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

A Support Vector Machine Classifier for Gene Name Recognition

A Support Vector Machine Classifier for Gene Name Recognition

Steffen Bickel1, Ulf Brefeld1, Lukas Faulstich2, Jörg Hakenberg2,*, Ulf Leser2, Conrad Plake2, and Tobias Scheffer1

1 Humboldt-Universität zu Berlin, Department of Computer Science, Knowledge Management Group
2 Humboldt-Universität zu Berlin, Department of Computer Science, Knowledge Management in Bioinformatics
* Corresponding author. Current affiliation: Knowledge Management in Bioinformatics, Dept. Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany. Phone: +49.30.2093.3903, eMail: hakenberg(a)informatik.hu-berlin.de


Abstract

This summary describes our solution for task 1A of the BioCreAtIvE Challenge Cup 2003. Essentially, we reduce the entity recognition problem to the problem of classifying single words using a Support Vector Machine followed by a term expansion. Our research question is therefore to find those types of features that eventually yield the highest precision and recall. We implemented and evaluated different features and combinations of features, such as n-grams, neighborhood defined by a sliding window, classification results of preceding words, appearance of special characters or digits, or appearance of the word in a dictionary. Multi-word entity names are gathered in a context-sensitive post-processing step. Our best set of features on the training set leads to a precision of 71.4% and a recall of 72.8%, corresponding to an F-measure of 72.1%, for the closed division.


Published in
EMBO Workshop: A critical assessment of text mining methods in molecular biology. Granada, Spain, March 2004