We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DB

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Databases

Title: Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie

Abstract: As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.
Comments: Pre-print of paper accepted at SIGMOD (DEEM2024)
Subjects: Databases (cs.DB); Machine Learning (cs.LG)
Cite as: arXiv:2404.13682 [cs.DB]
  (or arXiv:2404.13682v1 [cs.DB] for this version)

Submission history

From: Jacopo Tagliabue [view email]
[v1] Sun, 21 Apr 2024 14:53:33 GMT (434kb,D)

Link back to: arXiv, form interface, contact.