Forschungsseminar

Wissensmanagement in der Bioinformatik | Forschungsseminar Wissensmanagement in der Bioinformatik

Forschungsseminar

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:

Termin & Ort	Thema	Vortragende(r)
Freitag, 23.10.2020, 10 Uhr c.t., Online	Benchmarking State-Of-The-Art Time Series Motif Discovery Algorithms (Master Thesis)	Rafael Moczalla
Freitag, 23.10.2020, 11 Uhr c.t., Online	Interpreting Decisions of Deep Neural Networks to Identify Binding Preferences of Transcription Factors	Proft, Sebastian
Wednesday, 28.10.2020, 11 Uhr c.t., Online	Extracting Pathways from Text by Generating Graphs	Leon Weber
Friday, 13.11.2020, 11 Uhr c.t., Online	Exploring Classification Score Profiles for Change Point Detection	Arik Ermshaus
Friday, 20.11.2020, 13 Uhr c.t., Online	Classifying Astronomical Time Series from the Zwicky Transient Facility (ZTF) Survey.	Nicolas Miranda
Friday, 04.12.20, 11 Uhr c.t., Online	Empirical comparison of support vector regression and random forest regression in content-based filtering	Evelyn Ens
Wednesday, 9.12.20, 13:30, online	Computational deconvolution for patient stratification in the context of non-small cell lung cancer	Melanie Fattohi
Friday, 05.02.21, 11 s.t., online	Transcriptomic deconvolution of neuroendocrine neoplasms predicts clinically relevant characteristics	Raik Otto

Zusammenfassungen

Benchmarking State-Of-The-Art Time Series Motif Discovery Algorithms (Rafael Moczalla)

Motif discovery, i.e. the search for very similar repetitive patterns in data, has become very important in the analysis of large amounts of data in recent years, such as the recognition of unknown DNA sequences that can be assigned to a biological function or the recognition of specific brain states using EEG data. Many state-of-the-art algorithms for motif discovery have been established up to now. The generation of data with ground-truth annotated motifs is not a trivial task. In this presentation we propose a generator that generates data with ground-truth annotated motifs. We present a benchmark consisting of data with ground-truth annotated motifs. Finally, we perform an objective comparison of the runtimes and accuracies of the state-of-the-art motif discovery algorithms MK, SCRIMP, Scan MK, Cluster MK Set Finder, GrammarViz, EMMA and Learn Motifs using our benchmark.

Interpreting Decisions of Deep Neural Networks to Identify Binding Preferences of Transcription Factors (Proft, Sebastian)

Experiments that can identify transcription factor binding sites are costly, timeintensive, and tissue-specific. Finding a way to identify these binding sites without running a sequencing experiment for every single combination of cell type and transcription factor are well sought after. There exist several machine learning methods that try to tackle this problem, but most rely on extensive feature selection. Artificial neural networks are one such method that allows the use of sequencing data directly. They can employ a method known as convolution to learn the short DNA segments that contain these binding sites. These convolutional neural networks have been shown to classify sequences containing transcription factor binding sites successfully, but how they do so is still not well understood. Extracting the information from these “black boxes” is difficult, as usually only the output is observed and used to determine their performance. In this work, we train convolutional neural networks on increasingly more difficult datasets and apply methods such as maximum activation, input optimization, and layer-wise relevance propagation (LRP). These methods have already been applied to neural networks used in computer vision and allow us to understand better which parts of the input are most important for their decision-making process. We will apply them to DNA sequences to extract the DNA segments that correspond to the relevant transcription factor binding sites.

Extracting Pathways from Text by Generating Graphs (Leon Weber)

Biological pathways consist of multiple biochemical reactions that interact frequently in a complex manner. However, existing techniques for extracting pathway information from literature either reduce the complexity by modelling all pathways as binary interactions between participants or require richly annotated gold standard corpora for complex event structures which are scarce. We present a novel approach to pathway extraction, based on generative graph models, that only requires weakly labeled text and still can capture the full complexity of biochemical pathways.

Exploring Classification Score Profiles for Change Point Detection (Arik Ermshaus)

In recent years, the amount of unlabelled sensor data has grown significantly through the increase in computational power and omnipresence of sensors such as in smart devices. The literature, however, contains a great selection of time series classification algorithms which in turn require labelled datasets for training. In this talk, we explore supervised learning to assist solving unsupervised time series problems. We propose a novel self-supervised methodology that identifies self-similar time series regions by attaching labels to the left and right regions of hypothetical split points and evaluating binary classification problems to create a classification score profile. This profile illustrates to which degree a time series can be split into self-similar regions at the split points. We explore classification score profiles for single change point detection, assess our framework on a benchmark dataset and compare it to rival methods.

Classifying Astronomical Time Series from the Zwicky Transient Facility (ZTF) Survey. (Nicolas Miranda)

A new generation of astronomical observational projects hold the potential to be a true revolution for our understanding of both the origin of the Universe and the violent events it contains. Surveys that cover larger areas of the sky and revisit them with much higher regularity than previous ones will shed insights on the variability and physical models of transient phenomena. The Zwicky Transient Facility (ZTF) is currently the most ambitious such project, but is also a stepping stone for future projects with even higher observing capabilities.

In this work we are looking to aid the tasks of detection and characterization of astronomical sources that show a variable behavior in time. One of the key scientific use cases for this is the early classification of astronomical time series. By using interpolation methods, statistical features and Machine Learning models such as Boosted Trees and Recurrent Neural Networks, we are searching for ways to identify relevant transient and variable objects, with high confidence and low false positive rates, and using only few measurements. This allows for an efficient follow up of astronomical objects that offer a high scientific value to the community. Challenges for classification are small, imbalanced training sets and irregular time series sampling rates.

Empirical comparison of support vector regression and random forest regression in content-based filtering (Evelyn Ens)

The thesis evaluates a given data set with historical data and compares two regressions in the context of a content-based filtering recommendation system. The random forest regression is compared to the support vector regression. Both regressions are measured against each other with the regression error metrics mean absolute error (MAE) and the root square mean error RSME.
The thesis aims to determine if there are differences in the results when mea- sured with the given metrics. The data is processed and prepared in the context of content-based filtering. The experiment shows if one or the regression suites more to build a recommender system with the given dataset. The data is analyzed on various aspects such as brand loyalty and seasonality of purchases of products and product groups. Further, the data is used as an input for the random forest regression and the support vector regression. A random customer is chosen out of the dataset and the regression is performed on the historical data. The target value is customized, based on the previous data analysis. It is composed of the amount a product is bought by the selected customer and adjusted by a factor F based on the season the recommendations are generated in. The features are selected and the regression is performed with both types of regression. The output of both regressions is compared.
The differences between for the MAE and the RSME for both regressions are minimal. The MAE for the random forest regression is at 0.3408, while the output of the support vector regression is 0.3184. The RSME is 0.5885 and 0.5875 each. Both regressions perform poorly on the test set with high error rates. The respective differences between the RSME and the MAE can give clues as to the magnitude of the errors. Since the RSME gives greater errors more weight MAE < RSME means that some generalization errors are immense and some are smaller, however, the greater amount of errors are > 1. Therefore, the vast majority of predictions are far off the actual target value that let the generalization errors result in MAE < RSME.
In conclusion, there is no significant difference in the result of comparing the random forest regression with the support vector regression with the given ex- perimental setup. In general, the errors of both models do not differ too much and predictions of both models are almost identical. For future experiments, a different setup can be used and the consideration of not only one customer but all can be changed and features can be selected differently.

Computational deconvolution for patient stratification in the context of non-small cell lung cancer (Melanie Fattohi)

Tumor-infiltrating immune cells are known to affect tumor progression and are associated with patient outcome. Therefore, identifying the underlying immune cell (sub-) types and estimating their proportions in the heterogeneous tissue of a tumor is essential for developing personalized therapy strategies in order to optimize treatment outcome and estimate patient survival.
Deconvolution is a method for the identification and quantification of immune cells from bulk RNA-seq data of heterogeneous tumor tissue.
In this short presentation, factors affecting the deconvolution efficiency as well as different existing deconvolution methods are presented. Furthermore, an idea to potentially improve deconvolution efficiency in the context of quantifying tumor-infiltrating immune cells is shown in addition to a possible further use of the estimated cell type proportions.

Transcriptomic deconvolution of neuroendocrine neoplasms predicts clinically relevant characteristics (Raik Otto)

Therapeutic decisions in Oncology depend on the precise pathological characterization of individual neoplasms. Consequently, the Machine-Learning aided classification of neoplasms is in the focus of current research. However, the comprehensive training of Machine-Learning models requires sufficiently large amounts of training data, which is frequently not available for rare and simultaneously diverse cancer types. Pancreatic neuroendocrine neoplasms (panNENs) are rare and remarkably diverse with respect to clinical course and patient prognosis. Complementary support of clinicians with Machine-Learning models is therefore indicated for panNENs but difficult to achieve due to the scarcity of training data.

We report on a novel data-augmentation technique for supporting the clinical characterization of panNENs via a specific substitution of neoplastic training data with data of healthy origin. We apply a transcriptomic deconvolution algorithm trained on healthy samples to predict neoplastic cell-type proportions along with an interpretable reconstruction error. The output of the deconvolution is subsequently utilized as training data for Machine-Learning models, which in turn predict clinical characteristics of panNENs.

Benchmarks revealed that deconvolution-trained models efficiently predict the neoplastic grading, disease-related patient survival, and can differentiate between neuroendocrine tumor and carcinoma subtypes. Our deconvolution-derived model achieves the same prediction accuracy as a baseline model trained on neoplastic expression data and the Ki-67 gold-standard biomarker classified panNENs.

Our approach supports the clinical characterization of rare and diverse cancer types through a new Machine-Learning model based on data augmentation, yielding clinically interpretable results. These are important steps towards the application of Machine Learning-based data analysis of panNEN, and rare cancer types in general.

Kontakt: Patrick Schäfer; patrick.schaefer(at)hu-berlin.de