Forschungsseminar WBI - DBIS
Arbeitsgruppe Datenbanken und Informationssysteme | Arbeitsgruppe Wissensmanagement in der Bioinformatik
Neue Entwicklungen im Datenbankbereich und in der Bioinformatik
- wann? Montag 15.00 - 17.00 c.t.
- wo? RUD 26, 0'313
Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studenten und Gäste sind herzlich eingeladen.
Folgende Termine und Vorträge sind bisher vorgesehen:
Zusammenfassungen
EquatorNLP: Pattern-based Information Extraction for Disaster Response (Lars Döhling)
One of the most severe problems in early phases of disaster response is the lack of information about the current situation. Such information is indispensable for planning and monitoring rescue operations, but hardly available due to the breakdown of information channels and normal message routes. However, during recent disasters in developed countries, such as the flooding of New Orleans or the earthquake in New Zealand, a wealth of detailed information was posted by affected persons in media, such as Flickr, Twitter, or personal blogs. Finding and extracting such information may provide valuable clues for organizing aid, but currently requires humans to constantly read and analyze these messages. In this work, we report on a study for extracting such facts automatically by using a combination of deep natural language processing and advanced machine learning. Specially, we present an approach that learns patterns in dependency representations of sentences to find textually described facts about human fatalities. Our method achieves a F1 measure of 66.7% on a manually annotated corpus of 109 news articles about earthquake effects, demonstrating the general efficacy of our approach.
Challenges in Automatic Diagnosis Extraction from Medical Examination Summaries (Johannes Starlinger)
The automatic extraction of diagnoses from clinical discharge summaries is a field of intensive research. Other types of documents, both in clinical contexts and on the medical web, have not received as much attention, so far. Using the concrete example of automatic ICD10-coding of diagnoses in medical examination summaries, we identify the challenges of information extraction of highly specialty-specific short text documents, outline an approach to handle these challenges, and discuss the relevance of our findings in the context of the medical web.
Inferrence of Inter-Ontology Links (Philipp Hussels)
Entries in biomolecular databases are often annotated with terms from two or more ontologies and thereby establish links between pairs of orthogonal concepts. While many of these links are arbitrary, some may represent meaningful inter-ontology relationships. Given a set of similar database entries - for example a set of Entrez Gene entries identified in a micro array experiment - one would expect that inter-ontology concept pairs that are related in a meaningful manner are linked more frequently than others. However, link frequency also depends on the specificity of concepts. Obviously, specific concepts are less frequently used in annotations and therefore less frequently participate in inter-ontology links than general concepts. Furthermore, slightly different pairs of highly specific concepts - for example pairs comprised of sibling concepts in the respective ontologies - may represent the same semantic relationship. The goal of this thesis was to develop a methodology that uses the information encoded in intra-ontology structures to overcome this issue, and given a set of linked concept pairs finds those that are most likely related in a meaningful way.
The Quality of Protein-Protein Interactions Computationally Extracted from Literature and the Impact on PPI-Applications (Sebastian Arzt)
Information on protein-protein interactions (PPI) plays a crucial role for studies in the field of Systems Biology as most cellular mechanisms are executed by interacting proteins. Therefore, exploring the set of all interactions that can take place within a cell, known as the interactome, is essential to understand biological processes within organisms. This knowledge is, for instance, fundamental to reveal mechanisms behind diseases. In this work we review the state of the art of Relation Extraction (RE) in case of extracting protein-protein interactions and discuss how far the quality requirements are already met. We therefore extracted protein-protein interactions from scientific literature using recent tools. To estimate the extraction quality we compared the extracted interactions with data sets from manually curated databases. Subsequently, we analyzed how parameters of the relation extraction (i. e. confidence-score of the classifier, numbers of sentences supporting the relation, full text or abstract corpus) and characteristics of the pathways (i. e. size, species) influence the extraction quality. Beyond that, we reviewed if computational applications using protein-protein interactions will benefit from the extracted data in practice. We focused on two simple protein function prediction methods, which rely on interaction data.
Optimization of Information Extraction Tasks in Stratosphere (Astrid Rheinländer)
Large scale analytical text processing is important for many real-world scenarios. In drug development, for instance, it is extremely helpful to gather as much information as possible on the drug itself and on other, structurally similar drugs. Such information is contained in various large text collections like patent or scientific publication databases. As a part of the StratoSphere project, we investigate query-based analysis of large quantities of unstructured text and therefore develop a library of information extraction (IE) operators. Our extraction operators are configurable to embrace different IE strategies, either geared towards high throughput, high precision, or high recall. In this talk, we give an overview of the current status of our project and highlight potentials and strategies for optimizing complex IE plans.
Similarity Measures for Scientific Workflows (Johannes Starlinger)
In recent years, scientific workflows have been gaining attention as a tool for scientists to create reproducible in-silico experiments. For design and execution of such workflows, scientific workflow management systems (SWFM) have been developed. They enable the user to visually create pipelines of tasks for data extraction, processing, and analysis, including both local scripts and, especially, web-service calls. To facilitate sharing, reuse and repurposing of workflows, online workflow repositories are emerging. Such repositories, together with the increasing number of workflows uploaded to them, raise several new research questions. My PhD thesis focuses on the question of how to best enable both manual and automatic discovery of the workflows in a repository. More specifically, I study similarity measures for scientific workflows to allow their comparison and clustering, and eventually, the recommendation of workflows to the user. In this talk I will present the current status of my work and provide an outlook on planed future research.
Textmining support for metabolic network reconstruction (Michael Weidlich)
Metabolic and signaling networks representing complex physiological processes play an essential role in systems biology and drug research. The reconstruction and curation of these huge networks, usually directly from scientific publications, is a tedious and time consuming task. Textmining can accelerate this process by providing an automatic identification of network components like chemicals, proteins, their interactions and related meta-information in natural language texts. Using the example of 'HepatoNet1', a liver specific metabolic network developed by our scientific partners at the Charité, I will outline our experiences in developing a tailored infrastructure. Moreover I will present a preliminary evaluation on 'automatic-network-reconstruction' from PubMed abstracts and provide an outlook on future research.
Regulation of alternative splicing in Lymphoma (Karin Zimmermann)
Alternative Splicing (AS) is well known to contribute decicively to the variety of the human transcriptome. While several transcripts are known for a vast majority (up to 94 %) of genes, regulation of alternative splicing is still poorly understood. The importance of understanding this complex mechanisms is highligted by the fact that about one third of the known splicing events are involved in cancer. Splicing factors (SF) play a central role in the regulation of AS. By elucidating the relations between SFs, alternatively spliced exons or even different isoforms we aim at revealing dependencies between the later and thereby understanding the similarities as well as the differences in aberrant splicing in several different lymphoma subtypes.
Network-based inference of protein function and disease-gene association (Samira Jaeger)
Protein interaction networks are crucial to many aspects of cellular function. On the one hand, they present direct and robust manifestations of functional relationships. On the other hand, alterations in protein interactions perturb natural cellular processes and contribute to many diseases. Both correlations, the functional and the pathological one, have been exploited in this work to infer novel protein function for uncharacterized proteins as well as to associate yet uncharacterized proteins with disease phenotypes, respectively. In the first part of the thesis, we present a novel approach to predict protein function from protein interaction networks of multiple species. The key to our method is to study proteins within modules defined by evolutionary conserved processes, combining comparative cross-species genomics and functional linkage within interaction networks. Within conserved subgraphs we infer novel protein functions from orthology relationships across species and along conserved interactions of neighboring proteins within a species. We show that the combination of different sources of evidence for functional similarity between proteins reaches very high prediction precision, especially for multiple species. For instance, for the combination of human, fly and yeast we achieve a precision of 87%, 84% and 87%, respectively. Further, we predict many novel functions for uncharacterized or only weakly characterized proteins. When combining novel predictions from different species combinations, our method produces 27,100 novel annotations for human with an estimated precision of 83%. In the second part, we introduce a region-independent, network-based framework for identifying yet uncharacterized disease-related gene products by integrating protein interaction, protein function, and network centrality analysis. Given a disease, we first extract all genes known to be involved in this disease. We compile a disease-specific network by integrating directly and indirectly linked gene products using protein interaction and functional information. Proteins in this network are ranked based on their network centrality. Throughout evaluation we show that predicted functions enhance the ranking of disease-relevant proteins. Utilizing indirect interactions, on the other hand, significantly improves the cross-validation recovery rate up to 20%. However, considering indirect interactions integrates many global "hub" proteins which get high centrality ranks but are mostly disease unspecific. To adjust the ranking for a bias toward hub proteins in disease networks, we introduce a novel normalization procedure which decreases the fraction of highly ranking hub proteins (by 23%) while increasing the fraction disease proteins up to 22%.
Cross-platform microarray analysis for gene regulatory network reconstruction in T-cells (Stefan Kröger)
A better knowledge of the transcriptional regulation during T cell activation and differentiation is essential to understand the physiology and pathophysiology of the adaptive immune system. Therefore, we want to create a basic gene regulatory network of T cells using public available gene expression array data sets. A major challenge is to combine the data sets across different techniques, stimulations and experimental context. Furthermore, we want to show that the joint analysis of different microarray data sets in conjunction with additional data from ChIP-seq, and theoretical transcription factor binding site motif discovery is a promising basis for reconstructing gene regulatory networks.
Accelerating Betweenness Centrality Computation (André Koschmieder)
Betweenness Centrality is a network analysis measure to determine the importance of a particular vertex in a network. This measure is used to study large networks like the internet or social networks, and is also widely used in biological research. Applications include lethality in biological networks, study of AIDS spreading, gene function prediction, and analyzing protein interaction networks. Calculating Betweenness Centrality requires to solve the single source shortest path problem for every pair of nodes in the graph. The fastest currently known algorithm by Brandes requires O(nodes*edges) time and is thus not suitable for large-scale networks. This work shows ways to speed up centrality calculation by using a massively parallel approach as well as approximating centrality by using heuristics.
Data acquisition using text mining for the CellFinder project (Mariana Lara Neves)
In the CellFinder project, we aim to develop a virtual environment and data repository for the integration and analysis of the available data on stem cells and their derivatives. Our participation in this project is in the acquisition of new scientific data from the literature, by developing new text mining methods or using some of the tools already available for this purpose. Our data acquisition is composed of three steps: (1) retrieval from Medline of new relevant publications on the stem cell domain; (2) extraction of named-entities (e.g., cell lines, cell types, genes, tissues, organs), binary relationships (e.g., cell line part of an organ) and biological events (e.g., gene expression, differentiation); and (3) validation of the data which comes from our text mining methods by the curators. In this talk, I will present the current state of our part in the project, as well as some other parallel small projects I have developing in our group so far.
Parallelisierung von Text Mining Workflows in einer Cloud (Erik Dießler)
Die Diplomarbeit untersucht die Möglichkeit einer massiven parallelen und verteilten Ausführung von Text Mining Workflows in einer Cloud-Umgebung unter Verwendung von Nephele, UIMA und U-Compare. Die Unstructured Information Management Architecture (UIMA) ist ein standardisiertes Ausführungs- und Entwicklungsframework für die Verarbeitung von unstrukturierten Informationen. Nephele ist ein Framework zur parallelen und verteilten Bearbeitung von partitionierbaren Problemen in der Elastic Compute Cloud (EC2) der Amazon Web Services, dessen Fokus unter anderem auf den neuen Möglichkeiten und Anforderungen von Cloud-Architekturen liegt. Ein weiterer Teilaspekt der Arbeit ist ein Konzept für den Wiederaufsatz von Text Mining Workflows, welches im Fehlerfall oder nach Austausch von Workflowkomponenten Zwischenergebnisse für die Wiederverwendung nutzbar macht. Eine prototypische Einbettung von UIMA in die Nephele Architektur sowie des Wiederaufsatzkonzepts untersucht, unter Verwendung einer Fallstudie (Extraktion von Protein-Protein-Interaktionen), die technischen Möglichkeiten und Probleme das Ansatzes und der eingesetzten Softwarekomponenten. Der Prototyp ist in der Lage, mit U-Compare erstellte UIMA Text Mining Workflows durch Nephele verteilt und parallel in der EC2 Architektur auszuführen. Der Vortrag stellt den Ansatz sowie die Ergebnisse der Arbeit vor.
Protein-Protein interaction extraction from biomedical texts (Philippe Thomas)
Automated extraction of protein-protein interactions (PPI) from the literature is a key challenge in current biomedical text-mining. It has many applications both in building resources for researchers, especially by supporting database curators, and in original research, for instance in function prediction or network analysis. Most state of-the-art methods for PPI extraction tackle the classification problem by using machine learning techniques. In this talk I will present the current status of my work, focussing on semi-supervised machine learning techniques for relationship extraction.
Cost-based Optimization of Graph Queries in Relational Database Management Systems (Silke Trißl)
Graphs occur in many areas of life. We are interested on graphs in biology, where nodes are chemical compounds, enzymes, reactions, or interactions, which are connected by either directed or undirected edges. Efficiently querying these graphs is a challenging task. In this thesis we present GRIcano, a system that allows to efficiently execute graph queries. For GRIcano we assume that graphs are stored and queried using relational database management systems (RDBMS). We use an extended version of the pathway query language PQL to express graph queries, for which we describe the syntax and semantics in this work. We employ ideas from RDBMS to improve the performance of query execution. Thus, the core of GRIcano is a cost-based query optimizer that is created using the Volcano optimizer generator. This thesis makes contributions to all three required inputs of the optimizer, the relational algebra, implementations, and cost model. Relational algebra operators alone do not allow to express graph queries. Thus, we first present new operators to rewrite PQL queries to algebra expressions. We propose the reachability φ, distance Φ, path length ψ, and path operator Ψ. In addition, we provide rewrite rules for the newly proposed operators in combination with standard relational algebra operators to allow to rewrite the expressions and exchange operators. Secondly, we provide algorithms for each proposed operator. The main contribution here is GRIPP, an index structure that allows to efficiently execute reachability queries on very large graphs, containing directed edges. The advantage of GRIPP over other existing index structures, which we review in this work, is that GRIPP allows to answer reachability queries for a given pair of nodes in constant time, while the created index is in the size of the graph. We also show that we can employ GRIPP and the recursive query strategy, which we also present, to provide implementations for all four proposed operators. The third input for Volcano is the cost model, which requires cardinality estimates for the proposed operators and cost models for the used algorithms. Based on extensive experimental evaluation of the proposed algorithms on generated graphs we present functions to estimate the cardinality of the φ, Φ, ψ, and Ψ operator. In addition, we deduce functions for the presented algorithms to estimate the cost of execution. The novel approach is that these functions only use key figures of the graph, which are number of nodes and edges, degree of the node with highest outdegree, and number of nodes with zero outdegree. We finally present the effectiveness of GRIcano using exemplary graph queries on real biological networks.
Relation Extraction with Massive Seed and Large Corpora (Sebastian Krause)
Relation extraction (RE) is considered with the detection of relationships between objects or concepts in natural-language texts. Many approaches to the creation of RE systems follow machine-learning principles, thus, they need training examples with relation mentions. This talk outlines a RE system which follows the recent distant-supervision paradigm. Here, a massive amount of pre-existing knowledge is utilized to find mention samples in large plain-text corpora, thereby generating examples for the RE training phase. The work presented uses a rule-based approach to RE, combined with relatively deep linguistic analysis. For my Diplom thesis, the knowledge database Freebase was accessed to retrieve initial facts for targeted semantic relations. These facts were then used to query the Bing search engine for web pages mentioning the target relations. After several natural-language-processing steps, extraction rules were learned from the downloaded web pages. To reduce the amount of noisy rules, several extraction-rule filters were defined, in part processing learned rules from different relations at the same time to boost the accuracy of the filtering. After giving an overview of the system implementation, this talk will present the performed evaluation of the learned rules, including an error analysis.
Building a liver knowledgebase: representation of biochemical pathways and of evidence in OWL (Christian Bölling)
Computational approaches to study biochemistry require machine accessible representations of biochemical knowledge. Based on the requirements for assembling, managing and analysing genome-scale constraint-based models of liver metabolism we developed a semantic data model for representation of liver biochemical knowledge. Adopting a consistent perspective on biochemical processes as molecular events our OWL2-based representation is capable of representing biochemical processes of various complexity in terms of their parts and participants and the various roles they play. We further outline an information model that employs sidecar evidence ontologies to represent the evidence for the asserted properties of and among biochemical entities with great detail and conforming to OWL semantics. Our approach overcomes previous limitations in the modeling of pathways and evidence. We discuss how these elements can be incorporated in resources for representation and analysis of liver biochemistry.
Kontakt: Astrid Rheinländer; rheinlae(at)informatik.hu-berlin.de