Dataforge: A Data Agent Platform for Autonomous Data Engineering
- URL: http://arxiv.org/abs/2511.06185v1
- Date: Sun, 09 Nov 2025 01:58:13 GMT
- Title: Dataforge: A Data Agent Platform for Autonomous Data Engineering
- Authors: Xinyuan Wang, Yanjie Fu,
- Abstract summary: Data Agent is a fully autonomous system specialized for tabular data.<n>It automatically performs data cleaning, hierarchical routing, and feature-level optimization through dual feedback loops.<n>It embodies three core principles: automatic, safe, and non-expert friendly, which ensure end-to-end reliability without human supervision.
- Score: 22.691284342164334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing demand for AI applications in fields such as materials discovery, molecular modeling, and climate science has made data preparation an important but labor-intensive step. Raw data from diverse sources must be cleaned, normalized, and transformed to become AI-ready, while effective feature transformation and selection are essential for efficient training and inference. To address the challenges of scalability and expertise dependence, we present Data Agent, a fully autonomous system specialized for tabular data. Leveraging large language model (LLM) reasoning and grounded validation, Data Agent automatically performs data cleaning, hierarchical routing, and feature-level optimization through dual feedback loops. It embodies three core principles: automatic, safe, and non-expert friendly, which ensure end-to-end reliability without human supervision. This demo showcases the first practical realization of an autonomous Data Agent, illustrating how raw data can be transformed "From Data to Better Data."
Related papers
- Data Science and Technology Towards AGI Part I: Tiered Data Management [53.64581824953229]
We argue that the development of artificial intelligence is entering a new phase of data-model co-evolution.<n>We introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge.<n>We validate the effectiveness of the proposed framework through empirical studies.
arXiv Detail & Related papers (2026-02-09T18:47:51Z) - What's the next frontier for Data-centric AI? Data Savvy Agents [71.76058707995398]
We argue that data-savvy capabilities should be a top priority in the design of agentic systems.<n>We propose four key capabilities to realize this vision: Proactive data acquisition, Sophisticated data processing, Interactive test data synthesis, and Continual adaptation.
arXiv Detail & Related papers (2025-11-02T17:09:29Z) - A Survey of Data Agents: Emerging Paradigm or Overstated Hype? [66.1526688475023]
"Data agent" currently suffers from terminological ambiguity and inconsistent adoption.<n>This survey introduces the first systematic hierarchical taxonomy for data agents.<n>We conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.
arXiv Detail & Related papers (2025-10-27T17:54:07Z) - Scaling Generalist Data-Analytic Agents [95.05161133349242]
DataMind is a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents.<n>DataMind tackles three key challenges in building open-source data-analytic agents.
arXiv Detail & Related papers (2025-09-29T17:23:08Z) - Autonomous Data Agents: A New Opportunity for Smart Data [50.02229219403014]
Report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems.<n>DataAgents transform complex and unstructured data into coherent and actionable knowledge.<n>We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend.
arXiv Detail & Related papers (2025-09-23T06:46:41Z) - Meta-Learning and Synthetic Data for Automated Pretraining and Finetuning [2.657867981416885]
Growing number of pretrained models in Machine Learning (ML) presents significant challenges for practitioners.<n>As models grow in scale, the increasing reliance on real-world data poses a bottleneck for training and requires leveraging data more effectively.<n>This dissertation adopts meta-learning to extend automated machine learning to the deep learning domain.
arXiv Detail & Related papers (2025-06-11T12:48:45Z) - Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future [130.87142103774752]
This review systematically assesses over seventy open-source autonomous driving datasets.
It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets.
It also delves into the scientific and technical challenges that warrant resolution.
arXiv Detail & Related papers (2023-12-06T10:46:53Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.28944613907541]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.<n>We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - Scalable Modular Synthetic Data Generation for Advancing Aerial Autonomy [2.9005223064604078]
We introduce a scalable Aerial Synthetic Data Augmentation (ASDA) framework tailored to aerial autonomy applications.
ASDA extends a central data collection engine with two scriptable pipelines that automatically perform scene and data augmentations.
We demonstrate the effectiveness of our method in automatically generating diverse datasets.
arXiv Detail & Related papers (2022-11-10T04:37:41Z) - From Data to Actions in Intelligent Transportation Systems: a
Prescription of Functional Requirements for Model Actionability [10.27718355111707]
This work aims to describe how data, coming from diverse ITS sources, can be used to learn and adapt data-driven models for efficiently operating ITS assets, systems and processes.
Grounded in this described data modeling pipeline for ITS, wedefine the characteristics, engineering requisites and intrinsic challenges to its three compounding stages, namely, data fusion, adaptive learning and model evaluation.
arXiv Detail & Related papers (2020-02-06T12:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.