Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Forschungsseminar WBI - DBIS

Arbeitsgruppe Datenbanken und Informationssysteme | Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

Prof. Johann-Christoph Freytag und Prof. Ulf Leser

  • wann? Dienstag 15.00 - 17.00 c.t.
  • wo? Rud 26, 0'313

Dieses Seminar wird von den Mitgliedern der beiden Arbeitsgruppen als Forum der Diskussion und des Austauschs genutzt. Studenten und Gäste sind herzlich eingeladen.

Folgende Termine und Vorträge sind bisher vorgesehen:


Datum Thema   Vortragende(r)

19.04.2011
15.00 c.t.,
RUD 26, 0'313

Design and Implementation of an ETL Process and Data Warehouse for Literature mining of genetic Mutations in Cell Lines and statistical analysis of their occurrence
 

Martin Schenck

02.05.2011
11.00 c.t.,
RUD 25, HU-Kabinett


Text Mining for the Reconstruction of Protein-Protein Interaction Networks
 
 

Quang Long Nguyen

 
27.06.2011
13.00 c.t.,
RUD 25, ?

Cost-based Optimization of Graph Queries in Relational Database Management Systems
 
 
Silke Trißl

 
28.06.2011
15.00 c.t.,
RUD 26, 0'313

Text Mining für Outbreak Database
 
 
Marco Eckstein

 

04.07.2011
10.00 c.t.,
RUD 25, 3.113


Query Interface Extraktion und Integration
 
 

Thomas Kabisch

 

05.07.2011
15.00 c.t.,
RUD 26, 0'313


Gastvortrag: Developments in Integrative Network Biology
 
 

Karl Kugler

 

25.07.2011
10.00 c.t.,
RUD 25, 4.112


BioGraph: Knowledge Discovery and Exploration in the Biomedical domain
 
 

Jeroen De Knijf

 

26.07.2011
15.00 c.t.,
RUD 25, 4.112


Semantic Web + Life Sciences: History and Mystery
 
 

Sebastian Wandelt

 

28.07.2011
10.00 c.t.,
RUD 25, 4.112


Effective Multimodal Information Fusion by Structure Learning
 
 

Jana Kludas

 

02.08.2011
10.00 c.t.,
RUD 25, 4.112


Aufbau eines Ko-Expressionsnetzwerkes zur funktionellen Analyse von Transkriptomen
 
 

Ulrike Haase

 

02.08.2011
13.00 c.t.,
RUD 25, 4.112


Gastvortrag: Bridging the Vocabulary Gap between Questions and Answers Sentences in Question Answering Systems
 
 

Saeedeh Momtazi

 

05.08.2011
13.00 c.t.,
RUD 25, 4.112


Single-Step Extraction of Protein-Protein Interactions with Support Vector Machines
 
 

Tim Rocktäschel

 

12.08.2011
09.30 c.t.,
RUD 25, 4.112


Qualitätsmerkmale von Linked Data-veröffentlichenden Datenquellen
 
 

Annika Flemming

 

23.08.2011
13.30 c.t.,
RUD 25, Humboldt-Kabinett


Network-based inference of protein function and disease-gene association
 
 

Samira Jaeger

 

26.08.2011
16.00 c.t.,
RUD 25, 4.113


KLAS - A novel alternative splicing detection method based on Kullback-Leibler divergence
 
 

Marcel Jentsch

 

29.08.2011
11.00 c.t.,
RUD 25, 4.113


Relation Extraction for Drug-Drug Interactions using Ensemble Learning
 
 

Mariana Lara Neves

 

30.09.2011
14.00 c.t.,
RUD 25, 4.112


PiPa - Custom Integration of Protein Interactions and Pathways
 
 

Sebastian Arzt

 

05.10.2011
13.00 c.t.,
RUD 25, 4.112


Information extraction from specialized domain
 
 

Anne-Lyse Minard

 

10.10.2011
15.00 c.t.,
RUD 25, 4.112


Analysis of biological high-throughput data: Correlation between activation of intracellular signaling and proliferation
 
 

Richard Schäfer

 

10.10.2011
17.00 c.t.,
RUD 25, 4.112


Text Mining for the Reconstruction of Protein-Protein Interaction Networks
 
 

Quang Long Nguyen

 

18.10.2011
15.00 c.t.,
RUD 26, 0'307


EquatorNLP: Pattern-based Information Extraction for Disaster Response
 
 

Lars Döhling

 

Zusammenfassungen

Design and Implementation of an ETL Process and Data Warehouse for Literature mining of genetic Mutations in Cell Lines and statistical analysis of their occurrence (Martin Schenck)

The goal of the present work was to incorporate different text-mining methods to enable such automatic extraction of mutations in disease tissues. The field of bio-medical text-mining already comprises various tools developed for different purposes. Therefore, mostly a workflow was developed to identify, extract and store these data most effectively. The final workflow contains a combination of the best-performing tools, focused on extracting results with high precision. All results are stored in a data warehouse, which is the foundation of the tool. This data warehouse contains information on genes, diseases and gene expressions in cell lines. Results are normalised by linking them to synonymous entries stored in the warehouse. For validation, calculated and normalised results are compared to curated information extracted from other sources, like for example the Catalogue of Somatic Mutations in Cancer (COSMIC), and stored in the warehouse. The work-flow receives documents on the basis of PubMed ids as input. It first runs MutationFinder, a tool extracting mutations from plain-text abstracts. To find genes, either GNAT or the NCIBI name tagger can be chosen. Their performance is also compared in the course of this work. For a list of relevant diseases and cell lines, the tool uses all disease/cell line aliases related to cancer from the data warehouse and searches the documents with a set of regular expressions. For evaluation, the programmatic results were compared to a manually annotated text corpus derived from COSMIC. Utilizing GNAT, the tool achieved a precision of 86.84% and an overall F1-Score of 44.90%. The tool developed in the course of this work, named gemuline, enables researchers to scan scientific literature for mutations not yet listed in public databases. Thus, this thesis offers a good starting point to improve productivity in the drug discovery process. Currently, it stores information ex-tracted from approximately 127,000 PubMed abstracts.

Cost-based Optimization of Graph Queries in Relational Database Management Systems (Silke Trißl)

Graphs occur in many areas of life. We are interested on graphs in biology, where nodes are chemical compounds, enzymes, reactions, or interactions, which are connected by either directed or undirected edges. Efficiently querying these graphs is a challenging task. In this thesis we present GRIcano, a system that allows to efficiently execute graph queries. For GRIcano we assume that graphs are stored and queried using relational database management systems (RDBMS). We use an extended version of the pathway query language PQL to express graph queries, for which we describe the syntax and semantics in this work. We employ ideas from RDBMS to improve the performance of query execution. Thus, the core of GRIcano is a cost-based query optimizer that is created using the Volcano optimizer generator. This thesis makes contributions to all three required inputs of the optimizer, the relational algebra, implementations, and cost model. Relational algebra operators alone do not allow to express graph queries. Thus, we first present new operators to rewrite PQL queries to algebra expressions. We propose the reachability φ, distance Φ, path length ψ, and path operator Ψ. In addition, we provide rewrite rules for the newly proposed operators in combination with standard relational algebra operators to allow to rewrite the expressions and exchange operators. Secondly, we provide algorithms for each proposed operator. The main contribution here is GRIPP, an index structure that allows to efficiently execute reachability queries on very large graphs, containing directed edges. The advantage of GRIPP over other existing index structures, which we review in this work, is that GRIPP allows to answer reachability queries for a given pair of nodes in constant time, while the created index is in the size of the graph. We also show that we can employ GRIPP and the recursive query strategy, which we also present, to provide implementations for all four proposed operators. The third input for Volcano is the cost model, which requires cardinality estimates for the proposed operators and cost models for the used algorithms. Based on extensive experimental evaluation of the proposed algorithms on generated graphs we present functions to estimate the cardinality of the φ, Φ, ψ, and Ψ operator. In addition, we deduce functions for the presented algorithms to estimate the cost of execution. The novel approach is that these functions only use key figures of the graph, which are number of nodes and edges, degree of the node with highest outdegree, and number of nodes with zero outdegree. We finally present the effectiveness of GRIcano using exemplary graph queries on real biological networks.

Text Mining für Outbreak Database (Marco Eckstein)

In der an der Charité gepflegten Datenbank Outbreak Database (www.outbreak-database.com) werden Reports zu Krankheitsausbrüchen in medizinischen Einrichtungen gespeichert. Im Gegensatz zu den zu Grunde liegenden Artikeln (i. d. R. aus PubMed) erfolgt die Speicherung in strukturierter Form. Das Projekt soll medizinischem Personal ermöglichen, bei Ausbrüchen schnell effektive Gegenmaßnahmen finden zu können. Auch für Forschung und Weiterbildung ist Outbreak Database ein nützliches Werkzeug. Die manuelle Pflege der Datenbank durch Experten ist aufwändig. Durch eine Suche in PubMed und anschließende manuelle Filterung werden relevante Artikel gefunden. Aus diesen werden dann wiederum manuell die Informationen extrahiert, die erforderlich sind um die Felder der Reports in Outbreak Database auszufüllen. Im Rahmen der Diplomarbeit wurden verschiedene Ansätze für die Automatisierung der Teilaufgaben Klassifikation und Informationsextraktion untersucht. Für die Klassifikation kamen Methoden des maschinellen Lernens sowie zum Vergleich die bisher verwendeten PubMed-Suchanfragen zum Einsatz. Für die Informationsextraktion wurden Werkzeuge zur Termidentifikation (MetaMap und LINNAEUS) sowie Metadaten aus PubMed (insbesondere MeSH Headings) verwendet. Da die verschiedenen Datenbestände und Werkzeuge zueinander inkompatible kontrollierte Vokabulare nutzen, wurde der UMLS Metathesaurus verwendet und erweitert, um sie aufeinander abzubilden.

Query Interface Extraktion und Integration (Thomas Kabisch)

Web Datenbanken enthalten große Mengen von qualitativ hochwertigen strukturierten Inhalten. Viele populäre Anwendungen wie beipielsweise Produktvergleichs-Systeme erfordern Methoden für einen programmgestützten Datenbank-Zugriff und die Integration der unterliegenden Inhalte. Im Gegensatz zu beispielsweise relationalen Datenbanken unterstützen Web Datenbanken den programmgestützten Zugriff auf ihre Inhalte in der Regel nicht durch geeignete Schnittstellen. Ansätze, die einen automatisierten Zugriff auf Web Datenbanken bereitstellen können ausschließlich die für menschliche Interaktion konzipierten Anfrageschnittstellen (Query Interfaces) nutzen und sind daher in ihrer Realisierung sehr anspruchsvoll. Der Vortrag stellt neue Lösungen für drei zentrale Probleme bei der Integration von Web Datenbanken vor: (1) Extraktion von Query Interfaces, (2) Matching von Query Interfaces und (3) Klassifikation von unbekannten Query Interfaces bezüglich ihrer Anwendungsdomäne.

Developments in Integrative Network Biology (Karl Kugler)

Over the recent years the application of networks has become a major topic in biomedical research. While classical approaches focus on inspecting and analyzing single, isolated features network biology allows investigating sets of features that form complex systems. It is widely believed that network- based methods can accurately capture the dynamics that lie within biolog- ically complex systems, e.g. a cell. In the present work we take network biology a step beyond single networks, as we not only analyze biological data on a systems (network) level, but even combine information from a set of networks. Thereby, we can infer topological information and properties that are common to the networks that represent a certain condition. The typical approach for such an integrative analysis is to compare the proper- ties of edges that are common to the networks. We recently demonstrated such majority vote count method for prostate cancer networks. However, for these methods to be successfully applied it is necessary to map the vertex labels of the dierent networks. Thereby, common edges can be identied. Recently, a method for selecting a structural prototype for a set of networks was introduced. This alternative Graph Prototyping approach overcomes the problems of mapping to common identiers and nding common edges as it is based on measuring the distance between the networks. In related work we could illustrate how this method can be employed on a set of co- expression networks for prostate cancer. Here, we present selected topics that demonstrate how integrative analysis of networks can be used for bio- logical applications.

BioGraph: Knowledge Discovery and Exploration in the Biomedical domain (Jeroen De Knijf)

In this talk I will present a data integration and mining platform, for knowledge discovery in the biomedical domain. BioGraph allows for the automated formulation of functional hypotheses, relating concepts to targets. A typical setting in which BioGraph can assists is disease-gene prioritization. In the talk I will discuss the data modeling and integration, address specific properties of the unified graph and explain the ranking and hypotheses generating method in detail. Next, I will talk through some practical examples (mostly related to schizophrenia) and explain the testbed that we used to validate our results.

Semantic Web + Life Sciences: History and Mystery (Sebastian Wandelt)

There has been a lot of discussion in and outside the Semantic Web community about the usefulness of logics as a basis for describing data - especially in life sciences there are several well-founded criticisms. In this talk, we provide a short overview over the last decade of Semantic Web foundations, logics, and technologies. We relate these developments to the demands of life science (scientific) communities and analyze in how far these demands have been satisfied in the past. Furthermore, we discuss possible strategies to overcome common problems, e.g. scalability issues and the inability to model unspecific/soft constraints, in the future.

Effective Multimodal Information Fusion by Structure Learning (Jana Kludas)

The joint processing of multimodal data received a lot of attention in the last decade due to the increased availability of multimedia data and the under-performance of content-based approaches. This work shows that effective information fusion for multimedia document processing can be achieved by learning the data structure that underlies the fusion task and adapting the fusion strategy accordingly. To achieve this a feature selection and construction (FS/FC) algorithm is implemented that is based on attribute interaction detection. The effectiveness of the approach is shown on behalf of a boolean concept and a multimedia document classification task. The results also show that a inappropriate fusion strategy leads to performance loses.

Aufbau eines Ko-Expressionsnetzwerkes zur funktionellen Analyse von Transkriptomen (Ulrike Haase)

TBA

Single-Step Extraction of Protein-Protein Interactions with Support Vector Machines (TimRocktäschel)

Recent research in the automatic extraction of Protein-Protein Interactions (PPIs) yielded a continuous improvement. However, assessment of PPI methods in many cases is not realistic as protein names are expected to be known and errors that would propagate from previous Named Entity Recognition steps are neglected. Previous work showed that considering those errors, the F-Score of PPI extractors can drop up to 22.7 percentage points. One approach to address the problem of error propagation is applying joint inference. We build and evaluate a single-step extraction system for PPIs, i.e., a system that jointly predicts PPIs and their candidate entities, based on Support Vector Machines (SVMs). Thereby, we encounter the problem of training a SVM on strongly imbalanced datasets, leading to a performance decrease between 0.9 and 21.0 percentage points F-Score compared to the classic pipeline architecture. Moreover, this first approach is limited to single-token entities.

Bridging the Vocabulary Gap between Questions and Answers Sentences in Question Answering Systems (Saeedeh Momtazi)

Sentence retrieval plays an important role in question answering systems. It aims to find small segments of text that contain an exact answer to users' questions rather than overwhelm them with a large number of retrieved documents which they must sort through to find the desired answer. As the search in sentence retrieval is conducted over smaller segments of data than in a document retrieval task, the problems of data sparsity and exact matching become more critical than document retrieval. In this talk, we propose two different language modeling techniques to overcome vocabulary mismatch problem by capturing term relationships. The first method, the class-based language model, uses a word clustering algorithm to capture term relationships to deal with the data sparsity and vocabulary mismatch problems. In this model, we assume there is a relation between the terms that belong to the same cluster; as a result, they can be substituted when searching for relevant sentences. The second method, the trained trigger language model, finds pairs of trigger and target words when trained on a large corpus. If a trigger word appears in the question and a sentence contains the corresponding target word, the model considers a relation between the question and the sentence. The experimental results show that both models significantly improve sentence retrieval performance.

Qualitätsmerkmale von Linked Data-veröffentlichenden Datenquellen (Annika Flemming)

Das Web of Data stellt eine Weiterentwicklung des World Wide Web dar, die die Verarbeitung veröffentlichter Daten durch Software ermöglicht. Das dabei verwendeten Veröffentlichungsprinzipien ermöglichen dem Verfasser einer Datenquelle die Beschreibung von Dingen anhand frei definierbarer Eigenschaften sowie die Verknüpfung der eigenen Daten mit denen anderer Quellen. Die Veröffentlichung von Daten ist dabei jedem möglich, unabhängig von seiner Qualifikation und Motivation. Für einen Konsumenten der Daten wird es somit zunehmend schwieriger, aus der so entstehenden Menge an Datenquellen diejenigen auszuwählen, die für seine Zwecke geeignet sowie qualitativ hochwertig sind. Das Ziel dieser Diplomarbeit war es daher, ein System zur qualitativen Bewertung von Linked Data-veröffentlichenden Datenquellen zu entwerfen und exemplarisch zu implementieren. Hierzu wurden zunächst Merkmale erarbeitet, die qualitätsbezogene Eigenschaften einer Datenquelle beschreiben. Da Untersuchungen dieser Art in der Literatur des Web of Data bisher selten sind, wurden die Merkmale aus Publikationen ähnlicher Gebiete entlehnt und hinsichtlich ihrer Relevanz für das Web of Data untersucht. Um die Qualität einer Datenquelle bzgl. der für relevant befundenen erarbeiteten Merkmale bestimmen zu können, wurden anschließend messbare Aspekte der Merkmale, sogenannte Indikatoren, aufgestellt. Für jeden Indikator wurde zudem eine Berechnungsmethode definiert, anhand derer der Qualitätswert einer Datenquelle bzgl. dieses Indikators errechnet werden kann. Durch die Festlegung entsprechender Aggregations- und Gewichtungsfunktionen entstand ein System, mittels dessen ein Konsument die Qualitätsbewertung einer Datenquelle durchführen und an subjektive Präferenzen anpassen kann.

Network-based inference of protein function and disease-gene association (Samira Jaeger)

Protein interaction networks are crucial to many aspects of cellular function. On the one hand, they present direct and robust manifestations of functional relationships. On the other hand, alterations in protein interactions perturb natural cellular processes and contribute to many diseases. Both correlations, the functional and the pathological one, have been exploited in this work to infer novel protein function for uncharacterized proteins as well as to associate yet uncharacterized proteins with disease phenotypes, respectively. In the first part of the thesis, we present a novel approach to predict protein function from protein interaction networks of multiple species. The key to our method is to study proteins within modules defined by evolutionary conserved processes, combining comparative cross-species genomics and functional linkage within interaction networks. Within conserved subgraphs we infer novel protein functions from orthology relationships across species and along conserved interactions of neighboring proteins within a species. We show that the combination of different sources of evidence for functional similarity between proteins reaches very high prediction precision, especially for multiple species. For instance, for the combination of human, fly and yeast we achieve a precision of 87%, 84% and 87%, respectively. Further, we predict many novel functions for uncharacterized or only weakly characterized proteins. When combining novel predictions from different species combinations, our method produces 27,100 novel annotations for human with an estimated precision of 83%. In the second part, we introduce a region-independent, network-based framework for identifying yet uncharacterized disease-related gene products by integrating protein interaction, protein function, and network centrality analysis. Given a disease, we first extract all genes known to be involved in this disease. We compile a disease-specific network by integrating directly and indirectly linked gene products using protein interaction and functional information. Proteins in this network are ranked based on their network centrality. Throughout evaluation we show that predicted functions enhance the ranking of disease-relevant proteins. Utilizing indirect interactions, on the other hand, significantly improves the cross-validation recovery rate up to 20%. However, considering indirect interactions integrates many global "hub" proteins which get high centrality ranks but are mostly disease unspecific. To adjust the ranking for a bias toward hub proteins in disease networks, we introduce a novel normalization procedure which decreases the fraction of highly ranking hub proteins (by 23%) while increasing the fraction disease proteins up to 22%. Finally, we apply our framework successfully to identify novel surface membrane factors that contribute to HIV-1 infection.

KLAS - A novel alternative splicing detection method based on Kullback-Leibler divergence (Marcel Jentsch)

Nowadays, it is perceived that most human genes are alternatively spliced. The deregulation of alternative splicing can lead to the onset of serious disease e.g. several types of cancer. Necessary for the analysis of alternative splicing are high-throughput data generation techniques like exon arrays. The later allow for the evaluation of the expression level of a transcript by querying each exon component. We developed a new method called KLAS, based on the Kullback-Leibler divergence to indentify genome-wide alternative splicing events (ASEs). We benchmarked KLAS and a variety of splicing prediction methods using artificial data to compare their performance in different scenarios. Moreover, the performance of the methods was evaluated on a real dataset that comprises 8 different tissues with literature confirmed events extracted from the AEdb. Additionaly, the non-parametric statistical method Rank Product is introduced to the field of alternative splicing detection. The predictions from a set of methods are processed by Rank Product, yielding a new level of prediction.

Relation Extraction for Drug-Drug Interactions using Ensemble Learning (Mariana Lara Neves )

We describe our approach for the extraction of drug-drug interactions from literature. The proposed method builds majority voting ensembles of contrasting machine learning methods, which exploit different linguistic feature spaces. We evaluated our approach in the context of the DDI Extraction 2011 challenge, where using document-wise cross-validation, the best single classifier achieved an F1 of 57.3 % and the best ensemble achieved 60.6 %. On the held out test set, our best run achieved a F1 of 65.7 %.

PiPa - Custom Integration of Protein Interactions and Pathways (Sebastian Arzt)

TBA

Information extraction from specialized domain (Anne-Lyse Minard)

In this talk I will present the researches I have done during the two first years of my PhD. I work on information extraction and more precisely on relation extraction from biomedical domain. For helping experts to populate a database on renal physiology, I developed an interface which shows the extracted information and allows an expert to delete or modify this information. The interface used a tool based on information extraction techniques, which annotates information about experimentations from scientific papers (the numerical result of an experimentation, the parameter studied, the species, ...). I will also present the work I did on relation extraction from clinical reports for the i2b2 2010 challenge. Eight relations are defined, so I modeled this task as a multi-class classification based on an SVM. I will briefly present the features used and an evaluation of the contribution of the syntactic structure of the sentence in the extraction process. The last point of my presentation will be about the sentence simplification to improve the extraction of relations.

Analysis of biological high-throughput data: Correlation between activation of intracellular signaling and proliferation (Richard Schäfer)

The thesis is part of the ColoNet project, which aims at modelling the most important signaling processes in colorectal carcinoma. The main objective is to gain a profound understanding of the dynamics of of signal transduction and the eects of targeted therapies. For this purpose, high-throughput, real-time growth assays and phosphopro- teome measurements are conducted using colorectal cancer cell lines treated with signaling activators and inhibitors, respectively. These experiments are intended to produce detailled information on phos- phorylation states and cell proliferation and especially their coher- ence. In my thesis, I plan to establish a platform which integrates the measured phosphorylation levels of signaling proteins with cell pro- liferation. Since the biological and biochemical assays involve inter- experimental and intra-experimental variations, several normalization methods will be implemented and evaluated further. For data inter- pretation, it is imperative that certain patterns concerning the phos- phorylation and cell proliferation measurements can be recognized and reasonable be understood at large.

Text Mining for the Reconstruction of Protein-Protein Interaction Networks (Quang Long Nguyen)

A wealth of information is available only in web pages, patents, publications etc. Extracting information from such sources is the main purpose of text mining. This thesis focuses on the application, evaluation and the improvement of pattern-based approaches applied to relation extraction from biomedical documents. It presents techniques that improve a given baseline system, the Ali Baba algorithm for pattern-based relationship extraction, in all relevant aspects, i.e., recall, precision, and extraction speed. The thesis first reviews various information extraction approaches for the discovery of complex events among genes and proteins. Next, we introduce techniques for filtering sets of automatically generated patterns and analyze their effectiveness for different information extraction tasks. Essentially, our approach provides a solution to the problem of finding a ‘good” set of patterns in pattern-based relation extraction. We show that our techniques yield large improvements in all tasks we analyzed. For instance, they double the F-score for the task of gene expression extraction compared to the use of the original Ali Baba system. As a second major contribution, we present a simple yet effective filtering technique aiming at increasing the speed of relation extraction. The technique is based on evaluating patterns based on their potential to result in high scoring matches for a given sentence. We show that this idea leads to considerable speed-ups while incurring only a negligible penalty in effectiveness. For instance, they can yield a 100-fold increase in extraction speed at only 3% worse F-score. Finally, we present result of a large-scale application of our system to all available MEDLINE abstracts. The extraction results are evaluated by comparing them to manually curated information from biological databases. Overall, we used patterns that are automatically generated from 5 million citations to extract protein-protein interaction from other 15 million citations. Our system achieved precision, recall and F-score of 7.2%, 28.3% and 11.5%, respectively. We show that, our system generates patterns automatically without using any gold standard corpus and thus avoids tedious and time-consuming task of the manual curation of training data. In addition, our system performs very fast due to the simplicity and the effectiveness of pattern filtering algorithm.

EquatorNLP: Pattern-based Information Extraction for Disaster Response (Lars Döhling)

One of the most severe problems in early phases of disaster response is the lack of information about the current situation. Such information is indispensable for planning and monitoring rescue operations, but hardly available due to the breakdown of information channels and normal message routes. However, during recent disasters in developed countries, such as the flooding of New Orleans or the earthquake in New Zealand, a wealth of detailed information was posted by affected persons in media, such as Flickr, Twitter, or personal blogs. Finding and extracting such information may provide valuable clues for organizing aid, but currently requires humans to constantly read and analyze these messages. In this work, we report on a study for extracting such facts automatically by using a combination of deep natural language processing and advanced machine learning. Specially, we present an approach that learns patterns in dependency representations of sentences to find textually described facts about human fatalities. Our method achieves a F1 measure of 66.7% on a manually annotated corpus of 109 news articles about earthquake effects, demonstrating the general efficacy of our approach.

Kontakt: Astrid Rheinländer; rheinlae(at)informatik.hu-berlin.de