Text Analytics

Wissensmanagement in der Bioinformatik | Text Analytics

Text Analytics

Halbkurs im Wintersemester 2010/2011
Professor Ulf Leser

Der Halbkurs "Text Analytics" behandelt Methoden zur computergestützten Analyse von Texten. Themen reichen vom Information Retrieval (Suchmaschinen, Anfragesprachen, Indexierung, PageRank) über statistische Sprachverarbeitung / Computerlinguistik (Kollokationen, Sprachmodelle, Part-of-Speech Tagging, Disambiguierung) bis zum Text Mining (Dokumentklassifikation und -clustering, Informationsextraktion, Plagiaterkennung). Dabei werden sowohl algorithmische Grundlagen als auch konkrete Anwendungen behandelt.

Der Halbkurs wird durch ein Praktikum begleitet. Dieses vertieft die im Halbkurs gelernten Methoden durch praktische Umsetzung. In Gruppen werden verschiedene Probleme des Text Mining, oftmals unter Benutzung existierende Frameworks, gelöst.

Voraussetzungen

Voraussetzung für den Besuch sind grundlegende Kenntnisse in Algorithmen und gute Kenntnisse in Java.

Prüfungen

Prüfungen sind mündlich.

Anrechnung

Der Kurs (Vorlesung + Praktikum) kann angerechnet werden für

Diplominformatik, Halbkurs, 8SP

Literatur zur Vorlesung

Manning / Schütze: „Foundations of Statistical Natural Language Processing”, MIT Press, 1999. (At google books)
Baezo-Yates / Ribeiro-Neto: "Modern Information Retrieval", Addison-Wesley, 1999.
Weitere Literatur und Links

Themen und Termine im Einzelnen

(Folien sind hier jeweils vor der Vorlesung als PDF verfügbar. Änderungen möglich. All slides are English, but the course will be held in German).

Introduction and overview
Introduction to Information Retrieval
Evaluation of IR Systems; document normalization
IR Models I: Boolean, Vector Space, Relevance Feedback
IR Models II: Probabilistic Retrieval, Latent Semantic Indexing
(Korrigierte Version, 20.5.2008)
Exact online substring search: Z-Box and Boyer-Moore
Searching multiple patterns: Keyword Trees, Aho-Corasick, and PETER
Indexing terms: Inverted files and signature files
Searching the web: Crawling, PageRank and HITS
Guest lecture by Dr. Torsten Andreas: An Introduction to Linguistics
Weihnachten

Language models

Part-of-Speech (POS) tagging

Collocations and domain-specific terms

Text classification

Guest lecture by Prof. Felix Naumann: Linked Open Data

Text clustering

Named Entity Recognition

Word Sense Disambiguation

Relationship Extraction

Abschluss

Weitere Materialien

Text REtrieval Conference: TREC Hompage
BioCreative: Homepage (Teil 1 und 2)
Die OpenNLP Seite
IBM's Unstructured Information Management Architecture: UIMA
Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval
Lists of stop words
The NLTK toolkit - a library for natural language processing in python
A nice tutorial on SVD and latent semantic indexing by E. Garcia
POS-Tagset der UPenn Treebank
Google's n-gram viewer

Ergänzende Literatur

Feldman, Sanger: "The Text Mining Handbook", Cambridge Press, 2007
Grossmann, Frieder: "Information Retrieval", Springer, 2004 (in parts at ).
Online-Lehrbuch Information Retrieval 1 (Grundlagen, Modelle und Anwendungen), Prof. Henrich, Universität Bamberg.

Mo	Di	Mi	Do	Fr	Sa	So
31	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	1	2	3	4