ChemSpot
ChemSpot 2.0 is a set of tools for named entity recognition and classification of chemicals in natural language texts, including trivial names, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a combined approach of employing a Conditional Random Field and a dictionary, as well as pattern-based recognition, a classifier model and several methods for consolidating all annotations. ChemSpot also performs named entity normalization by assigning identifiers from numerous chemical databases. It achieves an F1 measure of 79.0% on the SCAI corpus.
ChemSpot is released under the Common Public License 1.0.
The warning message "Couldn't open cc.mallet.util.MalletLogger resources/logging.properties file." can be ignored.
Downloads
-
Archive including runnable jar-file, dictionary and models
- ChemSpot 2.0 (ca. 411 MB)
- ChemSpot 1.6 (ca. 223 MB)
-
Source code on GitHub or as zip (ca. 23MB)
git clone git@github.com:rockt/ChemSpot.git
- Detailed evaluation on the SCAI corpus and 5 other corpora
Changelog
Version 2.0
- Updated dictionary file and ids list (ChEBI version 2014-01-02)
- New multi-class label tagger and classifier model
- Support for classification of chemicals into: SYSTEMATIC, IDENTIFIER, FORMULA, TRIVIAL, ABBREVIATION, FAMILY and MULTIPLE
- New automated dictionary and id file updater
Version 1.6
- Updated dictionary file and ids list (ChEBI version 2013-12-03)
- Some improvements in annotation merger component
- Fixed bug where ChemSpot would not process some sentences if they weren't annotated properly
- Various minor improvements and bugfixes
Running ChemSpot from the Command-Line
-
Extract chemspot.zip into a directory
unzip chemspot.zip
-
To tag a sample text file, run
java -Xmx16G -jar chemspot.jar -t sample.txt -o predict.txt
Dictionary Update
-
To update the dictionary, run
java -Xmx5G -jar chemspot.jar -u
Memory Usage
- If you would like to reduce memory consumption and do not need ChemSpot to assign identifiers to chemicals, you can run it without the ids file. Note however that this will completely disable named entity normalization.
java -Xmx12G -jar chemspot.jar -t sample.txt -o predict.txt -i ""
-
If you would like to further reduce the memory footprint, you can run ChemSpot without the dictionary or multi-class model as well. Note however that this will result in worse NER performance.
java -Xmx7G -jar chemspot.jar -t sample.txt -o predict.txt -i "" -d ""
java -Xmx9G -jar chemspot.jar -t sample.txt -o predict.txt -i "" -M ""
Parameters
- arguments:
-m
path to a CRF model file (internal default model file will be used if not provided)-s
path to a OpenNLP sentence model file (internal default model file will be used if not provided)-d
path to a zipped set of brics dictionary automata (parameter defaults to 'dict.zip' if not provided)-i
path to a zipped tab-separated text file representing a map of terms to ids (parameter defaults to 'ids.zip' if not provided)-M
path to a multi-class model file (parameter defaults to 'multiclass.bin' if not provided)- flags:
-e
if this flag is set, the performance of ChemSpot on an IOB gold-standard corpus (cf. -c) is evaluated-u
if this flag is set, ChemSpot will update the dictionary and ids file-T
number of threads to create when processing a document collection- input control:
-c
path to a directory containing corpora in IOB format-g
path to a directory containing gzipped text files-t
path to a text file-f
path to a directory of text files- output control:
-o
path to output file-I
if this flag is set, the output will be converted into the IOB format
Using ChemSpot in your Code
ChemSpot tagger = ChemSpotFactory.createChemSpot("./dict.zip", "./ids.zip", "./multiclass.bin.gz");
String text = "The abilities of LHRH and a potent LHRH agonist ([D-Ser-(But),6, " +
"des-Gly-NH210]LHRH ethylamide) inhibit FSH responses by rat " +
"granulosa cells and Sertoli cells in vitro have been compared.";
for (Mention mention : tagger.tag(text)) {
System.out.printf("%d\t%d\t%s\t%s\t%s,\t%s%n",
mention.getStart(), mention.getEnd(), mention.getText(),
mention.getCHID(), mention.getSource(), mention.getType().toString());
}
Database identifiers as provided by Jochem and OPSIN
Method (in class Mention) | Database |
getCHID() | ChemIDplus |
getCHEB() | ChEBI |
getCAS() | CAS registry number |
getPUBC() | PubChem compound |
getPUBS() | PubChem substance |
getINCH() | InChI |
getDRUG() | DrugBank |
getHMBD() | Human Metabolome Database |
getKEGG() | KEGG compound |
getKEGD() | KEGG drug |
getMESH() | MeSH |
Reproducing our Results
- Download the SCAI corpus (chemicals-test-corpus-27-04-2009-v3.iob.gz) and put it in the same directory
-
To reproduce our results, run
java -Xmx16G -jar chemspot.jar -c chemicals-test-corpus-27-04-2009-v3.iob.gz -o predict.txt -e
Evaluation on the SCAI corpus
For a more detailed evaluation on the SCAI corpus and 5 other corpora, see the detailed evaluation.
Tool | Precision | Recall | F1 Measure |
ChemSpot 2.0 | 74.6 | 83.9 | 79.0 |
ChemSpot 1.6 | 77.0 | 73.5 | 75.2 |
OSCAR4.1 * | 66.0 | 75.3 | 70.3 |
ChemSpot (as published) | 67.3 | 68.9 | 68.1 |
OSCAR3 (Kolárik et al.) | 52 | 72 | 60 |
OSCAR3 (Hettne et al.) | 45 | 82 | 58 |
OSCAR4 | 45.7 | 76.5 | 57.3 |
OSCAR3 | 41.4 | 81.6 | 54.9 |
* This evaluation of OSCAR differs from former evaluations in that the better-performing PubMed recognizer model was used.
Links
ChemSpot largely builds upon the work of
-
Klinger, R., Kolárik, C., Fluck, J., Hofmann-Apitius, M., and Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like Chemical Names. Bioinformatics, 24(13), 268-276. In Proceedings of the International Conference Intelligent Systems for Molecular Biology (ISMB).
- BANNER: Leaman, R. and Gonzalez, G. (2008). BANNER: An executable survey of advances in biomedical named entity recognition. In Proceedings of the Pacific Symposium on Biocomputing, volume 13, pages 652-663.
- Jochem: Hettne, K., Stierum, R., Schuemie, M., Hendriksen, P., Schijvenaars, B., Mulligen, E., Kleinjans, J., and Kors, J. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics, 25(22).
- OPSIN: Lowe, D.M., Corbett, P.T., Murray-Rust, P. and Glen, R.C. (2011). Chemical Name to Structure: OPSIN, an Open Source Solution. J. Chem. Inf. Model. 51 (3), pages 739–753.
- ABBREV: Ariel Schwartz and Marti Hearst (2003). A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. In the proceedings of the Pacific Symposium on Biocomputing (PSB 2003) Kauai, Jan 2003.
- brics
Citing
Rocktäschel, T., Weidlich, M., and Leser, U. (2012). ChemSpot: A Hybrid System for Chemical Named Entity Recognition. Bioinformatics 28 (12): 1633-1640.Acknowledgements
We would like to thank Daniel Lowe and Philippe Thomas for many valuable suggestions.Development
ChemSpot was developed by Tim Rocktäschel, Torsten Huber and Michael Weidlich.
Contact
For further questions, remarks or bug-reports please contact:
thuber [at] informatik [dot] hu-berlin [dot] de