We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DB

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Databases

Title: Cleaning data with Swipe

Abstract: The repair problem for functional dependencies is the problem where an input database needs to be modified such that all functional dependencies are satisfied and the difference with the original database is minimal. The output database is then called an optimal repair. If the allowed modifications are value updates, finding an optimal repair is NP-hard. A well-known approach to find approximations of optimal repairs builds a Chase tree in which each internal node resolves violations of one functional dependency and leaf nodes represent repairs. A key property of this approach is that controlling the branching factor of the Chase tree allows to control the trade-off between repair quality and computational efficiency. In this paper, we explore an extreme variant of this idea in which the Chase tree has only one path. To construct this path, we first create a partition of attributes such that classes can be repaired sequentially. We repair each class only once and do so by fixing the order in which dependencies are repaired. This principle is called priority repairing and we provide a simple heuristic to determine priority. The techniques for attribute partitioning and priority repair are combined in the Swipe algorithm. An empirical study on four real-life data sets shows that Swipe is one to three orders of magnitude faster than multi-sequence Chase-based approaches, whereas the quality of repairs is comparable or better. Moreover, a scalability analysis of the Swipe algorithm shows that Swipe scales well in terms of an increasing number of tuples.
Subjects: Databases (cs.DB)
Cite as: arXiv:2403.19378 [cs.DB]
  (or arXiv:2403.19378v2 [cs.DB] for this version)

Submission history

From: Antoon Bronselaer [view email]
[v1] Thu, 28 Mar 2024 12:38:48 GMT (180kb,D)
[v2] Wed, 17 Apr 2024 10:11:29 GMT (182kb,D)

Link back to: arXiv, form interface, contact.