Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Tuning Text Classification for Hereditary Diseases with Section Weighting

Tuning Text Classification for Hereditary Diseases with Section Weighting

Jörg Hakenberg1*, Juliane Rutsch2, and Ulf Leser1

1 Humboldt-Universität zu Berlin, Department of Computer Science, Knowledge Management Group, Unter den Linden 6, 10099 Berlin, Germany.
2 School of Electrical Engineering and Computer Science, FH Stralsund, Zur Schwedenschanze 15, 18435 Stralsund, Germany.
* Corresponding author. Current affiliation: Knowledge Management in Bioinformatics, Dept. Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany. Phone: +49.30.2093.3903, eMail: hakenberg(a)informatik.hu-berlin.de


Abstract

Motivation: Information in life science publications is heterogeneously distributed over various sections. Depending on research questions, different sections cover more or less of the data needed to answer them. Our approach, called section weighting, seeks to make use of information coverage and density found in typical life science publications. We study the impact section weighting on text classification according to hereditary diseases.
Results: Our results indicate that weighting sections can improve text classification. Our systems gains 7% in F1-measure when we add section weighting. Proper composition of features is equally crucial, improving our results by 11%. Combining both techniques, the system yields a performance 18% higher than the baseline classifier. For our research question, favoring the sections Abstract, Introduction, and Materials and Methods yields the best results.


Published in
Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM), pp.34-37. Hinxton, UK, April 2005.
[ PDF] - [SMBM 2005]

@InProceedings{Hakenberg:2005a,
  author = {J\"org Hakenberg and Juliane Rutsch and Ulf Leser},
  title = {Tuning Text Classification for Hereditary Diseases with Section Weighting},
  booktitle = {Proc International Symposium on Semantic Mining in Biomedicine, SMBM},
  address = {Hinxton, UK},
  pages = {34-37},
  month = {April},
  year = 2005
}