A Survey of Pipeline Tools for Data Engineering
- URL: http://arxiv.org/abs/2406.08335v1
- Date: Wed, 12 Jun 2024 15:41:06 GMT
- Title: A Survey of Pipeline Tools for Data Engineering
- Authors: Anthony Mbata, Yaji Sripada, Mingjun Zhong,
- Abstract summary: A variety of pipeline tools are available for use in data engineering.
This survey examines the broad categories and examples of pipeline tools based on their design and data engineering intentions.
Case studies are presented indicating the usage of pipeline tools for data engineering.
- Score: 1.4856472820492366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion through data preparation to utilization as input for machine learning (ML). Some of these tools have essential built-in components or can be combined with other tools to perform desired data engineering operations. While some tools are wholly or partly commercial, several open-source tools are available to perform expert-level data engineering tasks. This survey examines the broad categories and examples of pipeline tools based on their design and data engineering intentions. These categories are Extract Transform Load/Extract Load Transform (ETL/ELT), pipelines for Data Integration, Ingestion, and Transformation, Data Pipeline Orchestration and Workflow Management, and Machine Learning Pipelines. The survey also provides a broad outline of the utilization with examples within these broad groups and finally, a discussion is presented with case studies indicating the usage of pipeline tools for data engineering. The studies present some first-user application experiences with sample data, some complexities of the applied pipeline, and a summary note of approaches to using these tools to prepare data for machine learning.
Related papers
- FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering [1.3599496385950987]
FlowETL is an example-based autonomous pipeline architecture designed to automatically standardise and prepare input datasets.<n>A Planning Engine uses a paired input-output datasets sample to construct a transformation plan, which is then applied by an worker to the source.<n>The results show promising generalisation capabilities across 14 datasets of various domains, file structures, and file sizes.
arXiv Detail & Related papers (2025-07-30T21:46:22Z) - Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems [8.816332263275305]
Traditional Data+AI systems rely heavily on human experts to orchestrate system pipelines.<n>Existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning.<n>We propose the concept of a 'Data Agent' - a comprehensive architecture designed to orchestrate Data+AI ecosystems.
arXiv Detail & Related papers (2025-07-02T11:04:49Z) - Procedural Environment Generation for Tool-Use Agents [55.417058694785325]
We introduce RandomWorld, a pipeline for the procedural generation of interactive tools and compositional tool-use data.<n>We show that models tuned via SFT and RL on synthetic RandomWorld data improve on a range of tool-use benchmarks.
arXiv Detail & Related papers (2025-05-21T14:10:06Z) - Text embedding models can be great data engineers [0.0]
We propose ADEPT, an automated data engineering pipeline via text embeddings.<n>We show that ADEPT outperforms the best existing benchmarks in a diverse set of datasets.
arXiv Detail & Related papers (2025-05-20T18:12:19Z) - Datasheets for AI and medical datasets (DAIMS): a data validation and documentation framework before machine learning analysis in medical research [0.0]
We extend the framework to "Datasheets for AI and medical datasets - DAIMS"
Our publicly available solution, DAIMS, provides a checklist including data standardization requirements.
The checklist consists of 24 common data standardization requirements, where the tool checks and validate a subset of them.
arXiv Detail & Related papers (2025-01-23T21:02:56Z) - Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs [0.061446808540639365]
This work explores the usage of Knowledge Graphs (KG) as a basic framework for capturing a human-centered manner complex analytics.
The data stored in the generated KG can then be exploited to provide assistance (e.g., recommendations) to the users interacting with these systems.
arXiv Detail & Related papers (2024-11-01T20:45:23Z) - Enabling Advanced Land Cover Analytics: An Integrated Data Extraction Pipeline for Predictive Modeling with the Dynamic World Dataset [1.3757956340051605]
We present a flexible and efficient end to end pipeline for working with the Dynamic World dataset.
This includes a pre-processing and representation framework which tackles noise removal, efficient extraction of large amounts of data, and re-representation of LULC data.
To demonstrate the power of our pipeline, we use it to extract data for an urbanization prediction problem and build a suite of machine learning models with excellent performance.
arXiv Detail & Related papers (2024-10-11T16:13:01Z) - Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans [3.2362171533623054]
We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their Machine Learning pipelines.
We extract "logical query plans" from ML pipeline code relying on popular libraries.
Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code.
arXiv Detail & Related papers (2024-07-10T11:35:02Z) - Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models.
Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions.
We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z) - Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of
Machine Learning Models [17.091169031023714]
Data Pipeline plays an indispensable role in tasks such as modeling machine learning and developing data products.
This paper focuses on exploring how to optimize data flow through automated machine learning methods.
We will discuss how to leverage AutoML technology to enhance the intelligence of Data Pipeline.
arXiv Detail & Related papers (2024-02-20T11:06:42Z) - EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [56.02100384015907]
EasyTool is a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction.
It can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios.
arXiv Detail & Related papers (2024-01-11T15:45:11Z) - Trusted Provenance of Automated, Collaborative and Adaptive Data Processing Pipelines [2.186901738997927]
We provide a solution architecture and a proof of concept implementation of a service, called Provenance Holder.
Provenance Holder enables provenance of collaborative, adaptive data processing pipelines in a trusted manner.
arXiv Detail & Related papers (2023-10-17T17:52:27Z) - Demonstration of InsightPilot: An LLM-Empowered Automated Data
Exploration System [48.62158108517576]
We introduce InsightPilot, an automated data exploration system designed to simplify the data exploration process.
InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining.
In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts.
arXiv Detail & Related papers (2023-04-02T07:27:49Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision
Datasets from 3D Scans [103.92680099373567]
This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world.
Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information.
Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks.
arXiv Detail & Related papers (2021-10-11T04:21:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.