Methods and Statistics
Named entity recognition
NER is based on several word lists representing known instances of an entity class, that is any type of biologically meaningful object (protein, gene, enzyme, drug). These lists were collected from various sources, as shown in the following.
- cells
- 3,034 terms, taken from the MeSH tree "A11"
- compounds
- 25,708 terms from KEGG
- diseases
- 34,528 terms from the MeSH tree "C"
- drugs
- 65,759 terms from MeSH trees "D03-D06" and DrugBank
- enzymes
- 28,631 terms from KEGG
- proteins and genes
- 710,309 terms from UniProt/SwissProt, fields: DE and GN
- species
- 694,629 terms from the NCBI taxonomy
- tissues
- 1,123 terms from the MeSH tree "A10"
Sometimes, Ali Baba predicts a wrong entity type for a given word (or multiple words). In this case, please use the feedback modus to submit a suggestion for correcting this occurrence.
Note that our set of drugs also includes many generic compounds, such as alcohol, caffeine, and general names such as hormones.
Word sense disambiguation
Many words that refer to entities recognized by Ali Baba are ambiguous in their meaning. The name of the drug 'Duration' can also be a common English word, as can the protein 'lamp'. 'Hippocampus' can refer to the brain areal, or a seahorse. Currently, Ali Baba disambiguates 304 such words, with an average accuracy of 89.7%. We collected a set of texts for each meaning of each word. On these texts, we trained support vector machine models that help to decide on the meaning of a new occurrence. The corpus we created for training and testing is available on here. It basically consists of names, and for each meaning of a name, a set of examples texts.
Relation mining
For many relations, Ali Baba searches for simple co-occurrences in the same sentence. For protein-protein interactions and cellular locations of proteins, a sophisticated strategy is used in addition, to also find meaningful relations and source/target-dependencies. The later system achieves a precision of 75% at 50% recall, as evaluated on the Spies corpus for protein-protein interactions (see Reference [2]). On the LLL challenge corpus, our systems scored best on one sub-task of interaction extraction, with an f-measure of above 50% (see Reference [6]). An external evaluation was done on the BioCreAtIvE II IPS corpus (see Reference [4]). Among 16 systems, Ali Baba was the 4th best (f-measure: 21%, best system: 26%) and had the highest recall rate for identified protein names (69%) among all systems.
Time performance
Ali Baba runs on a Linux server, 2x Intel P4/Xeon, 8GB RAM. Currently, Ali Baba parses 100 Medline abstracts in 30-45 seconds, depending on the number of relations (finding protein-protein interactions using patterns, see above, takes the longest).
Related projects
There are a couple of other applications available that perform tasks similar to Ali Baba:
iHOP
iHOP uses genes and proteins as hyperlinks between PubMed abstracts. It offers access to the underlying literature by means of a network of concurring genes and proteins. Users access the information by searching for gene names. "The network [..] contains half a million sentences and 30,000 different genes from humans, mice, D. melanogaster, C. elegans, zebrafish, Arabidopsis thaliana, yeast and Escherichia coli."
Available at: http://www.ihop-net.org/UniPub/iHOP/
EBIMed
EBIMed provides a quick overview of co-occurrences of a variety of entities: proteins, species, drugs, and Gene Ontology
(GO) terms. It searches all PubMed abstracts that fit an arbitrary user query and presents the resulting associations in
tabular form.
Available at: http://www.ebi.ac.uk/Rebholz-srv/ebimed/
GoPubMed
GoPubMed searches GO terms in PubMed abstracts and links them to the GO hierarchy, which can then be used to navigate
the result set.
Available at: http://www.gopubmed.org/
BioIE
BioIE extracts informative sentences from PubMed results that refer to structure, function, diseases and therapeutic
compounds, localization, or familial relationships of biological entities.
Available at: http://umber.sbs.man.ac.uk/dbbrowser/bioie/
Other biological NLP tools
For an exhaustive collection of tools for biological natural language processing in general (ranging from retrieval to relation mining), please see here. Thanks to Martin Krallinger at CNB.