Forschungsseminar
Arbeitsgruppe Wissensmanagement in der Bioinformatik
Neue Entwicklungen im Datenbankbereich und in der Bioinformatik
- wann/wo? siehe Vortragsliste
Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.
Folgende Vorträge sind bisher vorgesehen:
Termin & Ort | Thema | Vortragende(r) |
---|---|---|
Freitag, 18.10.2019, 10 Uhr c.t., RUD 25, 4.410 | Neural Biomedical Named Entity Normalization | Christopher Schiefer |
Freitag, 29.11.2019, 11 Uhr c.t., RUD 25, 4.410 | A framework for subword-level dictionary-based time series classification | Leonard Clauß |
Freitag, 07.02.2020, 10 Uhr c.t., RUD 25, 4.410 | The Flair Framework and Research Challenges in Natural Language Processing | Alan Akbik |
Freitag, 14.02.2020, 10 Uhr c.t., RUD 25, 4.410 | Haut.App - An ingredient database for skin tolerance prediction | Nijuscha Gruhn |
Donnerstag, 19.03.2020, 10 Uhr c.t., RUD 25, 4.410 | Visualisierung charakteristischer Feature zur Klassifizierung von Zeitreihen (Bachelorarbeit) | Nicolai Schneider |
Zusammenfassungen
Neural Biomedical Named Entity Normalization (Christopher Schiefer)
Methods for automatic information extraction from vast amounts of unstructured text become highly necessary due to the rapid growth of the biomedical literature. It is essential to identify biomedical entities in text documents in an automated way to enable tasks like searching for specific entities, providing document background information and linking similar documents. The process of linking a text mention to a specific entity identifier is called named entity normalization (NEN) or entity disambiguation. Previous approaches in the biomedical domain have been based on sets of manual rules, large dictionaries, and pre-defined features that are supposed to capture the knowledge of experts. In other domains, however, deep learning approaches have been able to outperform these traditional approaches significantly. We present a novel approach that leverages word embeddings as well as the latest improvements in biomedical Named Entity Recognition (NER) for the normalization of genes and compare it to state-of-the-art baseline approaches.
A framework for subword-level dictionary-based time series classification (Leonard Clauß)
The problem of time series classification appears in many important applications, for example detecting myocardial infarctions. The current state-of-the-art classifier WEASEL uses a dictionary-based approach. Specifically, the algorithm predicts the class of a time series by sliding windows of different sizes over it, discretizing each subsequence to a word and classifying based on the number of occurrences of these words. However, it does not consider subwords, i.e., subsequences of these words.
In this work, we evaluated different methods to select discriminative subwords: counting character-n-grams, finding long consecutive subsequences using Byte Pair Encoding and subwords with gaps using Apriori. To analyze their impact on classification we integrated them into the WEASEL pipeline and ran them on the UCR archive with 85 benchmark datasets. Our results show that compared to WEASEL the classification accuracy does not significantly differ for any approach. Due to the increased number of features, runtime and memory consumption are both increased significantly. Thus the examined methods do not provide a benefit for dictionary-based time series classification.
The Flair Framework and Research Challenges in Natural Language Processing (Alan Akbik)
Flair NLP is a widely-used open source framework for experimenting with different word embeddings in downstream NLP tasks such as named entity recognition (NER), text classification and similarity learning. In this talk, I give an overview of the framework with a focus on current research challenges. In particular, I present the idea of NLP models that never stop learning: such models can acquire new knowledge even at prediction time (i.e. after the training phase is completed) and so continuously improve. Applying this approach to NER, I show how we reach new state-of-the-art results across a range of evaluation tasks. I will also do a live-demo of such a model in action, to illustrate how it continues to learn during prediction. Time permitting, we will also discuss some general research ideas and open questions for future work.
Haut.App - An ingredient database for skin tolerance prediction (Nijuscha Gruhn)
What if you could identify a product that helps caring for your skin without doing harm due to ingredients not tolerated? People suffering from skin diseases should care for their skin diligently, because the state of the skin is correlated with the severity of the acute phase of their illness. We are building an ingredient database for individual skin tolerance prediction of personal care. We draw data from authorities, databases, literature and combine it with personal data. We have integrated about 50.000 ingredient names, a basic rating for the substances listed with the European authorities and are currently working on adding more product data. Our minimal function prototype, identifies ingredients in personal care you added to a watch-list, or that was found in products you did not tolerate in the past. Further development will elaborate on the information about the ingredients by analysis of abstracts and full-text articles in medical literature to specify tolerance prediction.
Visualisierung charakteristischer Feature zur Klassifizierung von Zeitreihen (Nicolai Schneider)
Ziel dieser Verteidigung ist es das Thema der Bachelorarbeit “Visualisierung charakteristischer Feature zur Klassifizierung von Zeitreihen” genauer zu betrachten und die hervorgekommenen Ergebnisse zu analysieren.
Die Präsentation wird eine Einführung in die Themen Zeitreihenklassifizierung, sowohl als auch verschiedener Visualisierungstechniken bieten. Im Mittelpunkt der Verteidigung wird das Konzept, sowie die Implementierung des entstandenen Frameworks TECI stehen. Die finale Evaluierung der Ergebnisse der Bachelorarbeit, als auch der Software TECI, soll anhand von Live-Beispielen erfolgen. Hierfür werden Testdatensätze aus dem UCR Time Series Classification Archive verwendet.
Kontakt: Patrick Schäfer; patrick.schaefer(at)hu-berlin.de