Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems
- URL: http://arxiv.org/abs/2507.01599v1
- Date: Wed, 02 Jul 2025 11:04:49 GMT
- Title: Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems
- Authors: Zhaoyan Sun, Jiayi Wang, Xinyang Zhao, Jiachi Wang, Guoliang Li,
- Abstract summary: Traditional Data+AI systems rely heavily on human experts to orchestrate system pipelines.<n>Existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning.<n>We propose the concept of a 'Data Agent' - a comprehensive architecture designed to orchestrate Data+AI ecosystems.
- Score: 8.816332263275305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional Data+AI systems utilize data-driven techniques to optimize performance, but they rely heavily on human experts to orchestrate system pipelines, enabling them to adapt to changes in data, queries, tasks, and environments. For instance, while there are numerous data science tools available, developing a pipeline planning system to coordinate these tools remains challenging. This difficulty arises because existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning. Fortunately, we have witnessed the success of large language models (LLMs) in enhancing semantic understanding, reasoning, and planning abilities. It is crucial to incorporate LLM techniques to revolutionize data systems for orchestrating Data+AI applications effectively. To achieve this, we propose the concept of a 'Data Agent' - a comprehensive architecture designed to orchestrate Data+AI ecosystems, which focuses on tackling data-related tasks by integrating knowledge comprehension, reasoning, and planning capabilities. We delve into the challenges involved in designing data agents, such as understanding data/queries/environments/tools, orchestrating pipelines/workflows, optimizing and executing pipelines, and fostering pipeline self-reflection. Furthermore, we present examples of data agent systems, including a data science agent, data analytics agents (such as unstructured data analytics agent, semantic structured data analytics agent, data lake analytics agent, and multi-modal data analytics agent), and a database administrator (DBA) agent. We also outline several open challenges associated with designing data agent systems.
Related papers
- AgenticData: An Agentic Data Analytics System for Heterogeneous Data [12.67277567222908]
AgenticData is an agentic data analytics system that allows users to pose natural language (NL) questions while autonomously analyzing data sources across multiple domains.<n>We propose a multi-agent collaboration strategy by utilizing a data profiling agent for discovering relevant data, a semantic cross-validation agent for iterative optimization based on feedback, and a smart memory agent for maintaining short-term context.
arXiv Detail & Related papers (2025-08-07T03:33:59Z) - WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization [68.46693401421923]
WebShaper systematically formalizes IS tasks through set theory.<n>WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.
arXiv Detail & Related papers (2025-07-20T17:53:37Z) - KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes [20.75018548918123]
We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines.<n>We show that these pipelines test the end-to-end capabilities of AI systems on data processing.<n>Our results show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, existing out-of-box models fall short.
arXiv Detail & Related papers (2025-06-06T21:18:45Z) - Toward Data Systems That Are Business Semantic Centric and AI Agents Assisted [0.0]
Business Semantics Centric, AI Agents Assisted Data System (BSDS)<n>BSDS redefines data systems as dynamic enablers of business success.<n>System includes curated data linked to business entities, knowledge base for context-aware AI agents, and efficient data pipelines.
arXiv Detail & Related papers (2025-06-05T19:06:06Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - Building Multi-Agent Copilot towards Autonomous Agricultural Data Management and Analysis [2.763670421921841]
We build a proof-of-concept multi-agent system called ADMA Copilot, which can understand user's intent.
ADMA Copilot accomplishes tasks automatically, in which three agents: a LLM based controller, an input formatter and an output formatter collaborate.
arXiv Detail & Related papers (2024-10-31T20:15:14Z) - CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems [10.71630696651595]
Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks have garnered significant interest within database and AI communities.
silos of multimodal data sources make it difficult to identify appropriate data sources for accomplishing the task at hand.
We propose CMDBench, a benchmark modeling the complexity of enterprise data platforms.
arXiv Detail & Related papers (2024-06-02T01:10:41Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.28944613907541]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.<n>We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - Demonstration of InsightPilot: An LLM-Empowered Automated Data
Exploration System [48.62158108517576]
We introduce InsightPilot, an automated data exploration system designed to simplify the data exploration process.
InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining.
In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts.
arXiv Detail & Related papers (2023-04-02T07:27:49Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data
Programming [77.38174112525168]
We present Nemo, an end-to-end interactive Supervision system that improves overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS supervision approach.
arXiv Detail & Related papers (2022-03-02T19:57:32Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.