Forschungsseminar Sommersemester 2005
"Neue Entwicklungen in der Bioinformatik und Informationsintegration"
Melanie Weis
DogmatiX Tracks down Duplicates in XML
Duplicate detection is the problem of detecting different entries in
a data source representing the same real-world entity. While research
abounds in the realm of duplicate detection in relational data, there
is yet little work for duplicates in other, more complex data models,
such as XML. In this paper, we present a generalized framework for
duplicate detection, dividing the problem into three components:
candidate definition defining which objects are to be compared,
duplicate definition defining when two duplicate candidates are in fact
duplicates, and duplicate detection specifying how to effciently find
those duplicates.
Using this framework, we propose an XML duplicate detection method,
DogmatiX, which compares XML elements based not only on their direct
data values, but also on the similarity of their parents, children,
structure, etc. We propose heuristics to determine which of these to
choose, as well as a similarity measure specifically geared towards the
XML data model. An evaluation of our algorithm using several heuristics
validates our approach.