Forschungsseminar
Arbeitsgruppe Wissensmanagement in der Bioinformatik
Neue Entwicklungen im Datenbankbereich und in der Bioinformatik
- wann/wo? siehe Vortragsliste
Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.
Folgende Vorträge sind bisher vorgesehen:
Termin & Ort | Thema | Vortragende(r) |
---|---|---|
Friday, 28.05.21, 10 s.t. (online) | Architecture Concepts for Data Management in Data Lakes | Corinna Gabler (IPVS Stuttgart) |
Friday, 18.06.21, 14:30 s.t. (online) | Latent Motif Discovery using Maximum Clique algorithms Occasion: Studienprojekt |
Leonard Clauß |
Friday, 6.8.21, 10 s.t. (online) | Modern Multidimensional Main-Memory Index Structures | Quentin Kniep |
Friday, 13.08.21, 10 am (online) | Clinical classification of human neoplasms based on a transcriptomic deconvolution model trained on single-cell RNA-sequencing samples from healthy donors Occasion: Abschlussvortrag Forschungspraktikum |
Melanie Fattohi |
Friday, 20.08.21, 10 am (online) | Lessons Learned from the Time Series Anomaly Detection Challenge | Arik Ermshaus |
Friday 17.09.21, 10am (online) | Machine learning from materials similarity | Martin Kuban |
Zusammenfassungen
Architecture Concepts for Data Management in Data Lakes (Corinna Gabler (IPVS Stuttgart))
Latent Motif Discovery using Maximum Clique algorithms
Occasion: Studienprojekt (Leonard Clauß)
A time series is a sequence of real valued numbers ordered in time.
Latent motif discovery is the problem of finding frequently occurring
patterns in time series, where the pattern does not need to occur
exactly. This problem finds application in many domains, such as
medicine and robotics. Our definition of the top latent motif in a time
series is the largest set of subsequences that are pairwise similar and
non-overlapping. In literature, there exists no exact method that solves
this problem. Thus we propose a novel algorithm named CliqueMotif. It
first creates the so-called distance graph that contains a node for each
subsequence of the given length and an edge between two nodes if their
respective subsequences are within a specified radius. Then, the maximum
clique is found, which corresponds to the top latent motif. Our
evaluation shows that the algorithm performs well on problem instances
with short time series and low motif radii but does not scale well.
Clinical classification of human neoplasms based on a transcriptomic deconvolution model trained on single-cell RNA-sequencing samples from healthy donors (Melanie Fattohi)
Pancreatic neuroendocrine neoplasms (panNENs) are a rare type of cancer that presents hetero-
geneously in patients. Since an insufficient amount of data is available for research on all subtypes of panNENs, clinical characterization of neoplastic samples by means of Machine learning (ML) is hindered. The current gold standard approach for classification of panNENs are staining levels of Ki-67 protein. However, as the used grading system lacks clarity, [Otto et al., 2021] developed a data augmentation strategy and a deconvolution based ML approach to support the gold standard approach in clinically characterizing panNENs.
In this research internship we reproduced the study of [Otto et al., 2021], with the difference that we used the new deconvolution method SCDC by [Dong et al., 2020]. We performed transcriptomic deconvolution of panNEN bulk RNA-sequencing (RNA-seq) samples based on single-cell RNA-seq data of healthy pancreatic tissue, thereby addressing the problem of the lack of panNEN data. Moreover, we trained a ML model on the thus predicted cell type proportions for the classification of panNEN samples.
We found that predicted ductal cell type proportions statistically significantly correlated with both
grading levels of the panNEN bulk RNA-seq samples as well as measured MKi-67 expression
levels. Furthermore, the predictive performance of the ML models trained on predicted cell type
proportions was comparable to a ML model trained on measured MKi-67 expression levels. The
predicted ductal cell type proportions were among the most informative features of the trained ML models despite the circumstance that ductal cells are generally not seen as a possible cell type of origin for endocrine cancer.
The findings of this research internship show that cell type proportions of panNENs predicted via
deconvolution based on healthy pancreatic single-cells complement the gold standard approach in clinically classifying panNEN samples. Thus, the data-augmentation strategy and ML framework developed by [Otto et al., 2021] as well as their biologically relevant findings could be reproduced, which was of critical importance since, to the best of our knowledge, no other research has been published up to this point that replicated these findings on panNENs.
Machine learning from materials similarity (Martin Kuban)
The recent development of large public databases paved the way for data driven analysis in materials science. In this talk I will give a brief introduction to the challenges that are specific to materials, introduce different data sources and how the access to those sources can be simplified using a framework for analysing materials data. Finally I will showcase applications of similarity measures for materials to unsupervised and supervised machine learning tasks.
Kontakt: Patrick Schäfer; patrick.schaefer(at)hu-berlin.de