Forschungsseminar SoSe04
Forschungsseminar
"Neue Entwicklungen in der Bioinformatik und
Informationsintegration"
- Freitag, 4. Juni 2004, 11.15 Uhr. RUD 25, Raum IV.111 -
Duplicate Detection in XML Documents
- Melanie Weis
- Arbeitsgruppe Informationsintegration, HU Berlin
The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.
In this presentation, I present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level, by basically measuring their similarity using a thresholded similarity measure. We consider efficiency by reducing the number of pairwise string and element comparisons.
To show the effectiveness of our approach, first experimental results are presented as well.