Reproducible data science over data lakes: replayable data pipelines   with Bauplan and Nessie
        - URL: http://arxiv.org/abs/2404.13682v1
- Date: Sun, 21 Apr 2024 14:53:33 GMT
- Title: Reproducible data science over data lakes: replayable data pipelines   with Bauplan and Nessie
- Authors: Jacopo Tagliabue, Ciro Greco, 
- Abstract summary: We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie.
We demonstrate its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline with a few CLI commands.
- Score: 5.259526087073711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands. 
 
      
        Related papers
        - KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over   Data Lakes [20.75018548918123]
 We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines.<n>We show that these pipelines test the end-to-end capabilities of AI systems on data processing.<n>Our results show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, existing out-of-box models fall short.
 arXiv  Detail & Related papers  (2025-06-06T21:18:45Z)
- Bauplan: zero-copy, scale-up FaaS for data pipelines [4.6797109107617105]
 bauplan is a novel F programming model and serverless runtime designed for data practitioners.
bauplan enables users to declaratively define functional Directed Acyclic Graphs (DAGs) along with their runtime environments.
We show that bauplan both better performance and a superior developer experience for data workloads by making trade-off of reducing generality in favor of data-awareness.
 arXiv  Detail & Related papers  (2024-10-22T22:49:01Z)
- BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and   Multistructured Data [61.936320820180875]
 Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
 arXiv  Detail & Related papers  (2024-10-01T15:11:24Z)
- In-depth Analysis On Parallel Processing Patterns for High-Performance
  Dataframes [0.0]
 We present a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon.
In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns.
We evaluate the performance of Cylon on the ORNL Summit supercomputer.
 arXiv  Detail & Related papers  (2023-07-03T23:11:03Z)
- Data-Copilot: Bridging Billions of Data and Humans with Autonomous   Workflow [49.724842920942024]
 Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
 arXiv  Detail & Related papers  (2023-06-12T16:12:56Z)
- Deep Lake: a Lakehouse for Deep Learning [0.0]
 Deep Lake is an open-source lakehouse for deep learning applications developed at Activeloop.
This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop.
 arXiv  Detail & Related papers  (2022-09-22T05:04:09Z)
- TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
  Environments [84.6017003787244]
 This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
 arXiv  Detail & Related papers  (2022-08-16T20:46:08Z)
- Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
  Preprocessing Pipelines [77.45213180689952]
 Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
 arXiv  Detail & Related papers  (2022-02-17T14:31:58Z)
- SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
 In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
 arXiv  Detail & Related papers  (2021-12-22T14:45:37Z)
- Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision
  Datasets from 3D Scans [103.92680099373567]
 This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world.
Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information.
Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks.
 arXiv  Detail & Related papers  (2021-10-11T04:21:46Z)
- MLCask: Efficient Management of Component Evolution in Collaborative
  Data Analytics Pipelines [29.999324319722508]
 We address two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask.
We define and accelerate the metric-driven merge operation by pruning the pipeline search tree using reusable history records and pipeline compatibility information.
The effectiveness of MLCask is evaluated through an extensive study over several real-world deployment cases.
 arXiv  Detail & Related papers  (2020-10-17T13:34:48Z)
- A Big Data Lake for Multilevel Streaming Analytics [0.4640835690336652]
 This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake.
We discuss and compare different open source and commercial platforms that can be used to develop a data lake.
We present a real-world data lake development use case for data stream ingestion, staging, and multilevel streaming analytics.
 arXiv  Detail & Related papers  (2020-09-25T19:57:21Z)
- TODS: An Automated Time Series Outlier Detection System [70.88663649631857]
 TODS is a highly modular system that supports easy pipeline construction.<n>Tods supports 70 primitives, including data processing, time series processing, feature analysis, detection algorithms, and a reinforcement module.
 arXiv  Detail & Related papers  (2020-09-18T15:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.