AgenticData: An Agentic Data Analytics System for Heterogeneous Data
- URL: http://arxiv.org/abs/2508.05002v1
- Date: Thu, 07 Aug 2025 03:33:59 GMT
- Title: AgenticData: An Agentic Data Analytics System for Heterogeneous Data
- Authors: Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, Yuan Li,
- Abstract summary: AgenticData is an agentic data analytics system that allows users to pose natural language (NL) questions while autonomously analyzing data sources across multiple domains.<n>We propose a multi-agent collaboration strategy by utilizing a data profiling agent for discovering relevant data, a semantic cross-validation agent for iterative optimization based on feedback, and a smart memory agent for maintaining short-term context.
- Score: 12.67277567222908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing unstructured data analytics systems rely on experts to write code and manage complex analysis workflows, making them both expensive and time-consuming. To address these challenges, we introduce AgenticData, an innovative agentic data analytics system that allows users to simply pose natural language (NL) questions while autonomously analyzing data sources across multiple domains, including both unstructured and structured data. First, AgenticData employs a feedback-driven planning technique that automatically converts an NL query into a semantic plan composed of relational and semantic operators. We propose a multi-agent collaboration strategy by utilizing a data profiling agent for discovering relevant data, a semantic cross-validation agent for iterative optimization based on feedback, and a smart memory agent for maintaining short-term context and long-term knowledge. Second, we propose a semantic optimization model to refine and execute semantic plans effectively. Our system, AgenticData, has been tested using three benchmarks. Experimental results showed that AgenticData achieved superior accuracy on both easy and difficult tasks, significantly outperforming state-of-the-art methods.
Related papers
- Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z) - A Survey of Data Agents: Emerging Paradigm or Overstated Hype? [66.1526688475023]
"Data agent" currently suffers from terminological ambiguity and inconsistent adoption.<n>This survey introduces the first systematic hierarchical taxonomy for data agents.<n>We conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.
arXiv Detail & Related papers (2025-10-27T17:54:07Z) - InferA: A Smart Assistant for Cosmological Ensemble Data [0.5130440339897478]
InferA is a multi-agent system that enables scalable and efficient scientific data analysis.<n>At the core of the architecture is a supervisor agent that orchestrates a team of specialized agents responsible for distinct phases of the data retrieval and analysis.<n>To demonstrate the framework's usability, we evaluate the system using ensemble runs from the HACC cosmology simulation which comprises several terabytes.
arXiv Detail & Related papers (2025-10-14T18:47:22Z) - Scaling Generalist Data-Analytic Agents [95.05161133349242]
DataMind is a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents.<n>DataMind tackles three key challenges in building open-source data-analytic agents.
arXiv Detail & Related papers (2025-09-29T17:23:08Z) - LLM/Agent-as-Data-Analyst: A Survey [54.08761322298559]
Large language models (LLMs) and agent techniques have brought a fundamental shift in the functionality and development paradigm of data analysis tasks.<n>LLMs enable complex data understanding, natural language, semantic analysis functions, and autonomous pipeline orchestration.
arXiv Detail & Related papers (2025-09-28T17:31:38Z) - Autonomous Data Agents: A New Opportunity for Smart Data [50.02229219403014]
Report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems.<n>DataAgents transform complex and unstructured data into coherent and actionable knowledge.<n>We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend.
arXiv Detail & Related papers (2025-09-23T06:46:41Z) - Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First [72.85721148326138]
Large Language Model (LLM) agents are likely to become the dominant workload for data systems in the future.<n>Agentic speculation can pose challenges for present-day data systems.<n>We outline a number of new research opportunities for a new agent-first data systems architecture.
arXiv Detail & Related papers (2025-08-31T21:19:40Z) - I2I-STRADA -- Information to Insights via Structured Reasoning Agent for Data Analysis [0.0]
Real-world data analysis requires a consistent cognitive workflow.<n>We introduce I2I-STRADA, an agentic architecture designed to formalize this reasoning process.
arXiv Detail & Related papers (2025-07-23T18:58:42Z) - Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems [8.816332263275305]
Traditional Data+AI systems rely heavily on human experts to orchestrate system pipelines.<n>Existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning.<n>We propose the concept of a 'Data Agent' - a comprehensive architecture designed to orchestrate Data+AI ecosystems.
arXiv Detail & Related papers (2025-07-02T11:04:49Z) - Deep Research Agents: A Systematic Examination And Roadmap [79.04813794804377]
Deep Research (DR) agents are designed to tackle complex, multi-turn informational research tasks.<n>In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute DR agents.
arXiv Detail & Related papers (2025-06-22T16:52:48Z) - Structuring the Unstructured: A Multi-Agent System for Extracting and Querying Financial KPIs and Guidance [54.25184684077833]
We propose an efficient and scalable method for extracting quantitative insights from unstructured financial documents.<n>Our proposed system consists of two specialized agents: the emphExtraction Agent and the emphText-to-Agent
arXiv Detail & Related papers (2025-05-25T15:45:46Z) - TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes [25.05627023905607]
We envision a new multi-modal data analytics system based on the Model Context Protocol (MCP)<n>First, we define a semantic operator hierarchy tailored for querying multi-modal data in data lakes.<n>Next, we introduce an MCP-based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities.
arXiv Detail & Related papers (2025-05-16T14:03:30Z) - DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science [4.1431677219677185]
DatawiseAgent is a notebook-centric agent framework that unifies interactions among user, agent and the computational environment.<n>It orchestrates four stages, including DSF-like planning, incremental execution, self-ging, and post-filtering.<n>It consistently outperforms or matches state-of-the-art methods across multiple model settings.
arXiv Detail & Related papers (2025-03-10T08:32:33Z) - InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation [79.09622602860703]
We introduce InsightBench, a benchmark dataset with three key features.<n>It consists of 100 datasets representing diverse business use cases such as finance and incident management.<n>Unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics.
arXiv Detail & Related papers (2024-07-08T22:06:09Z) - CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems [10.71630696651595]
Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks have garnered significant interest within database and AI communities.
silos of multimodal data sources make it difficult to identify appropriate data sources for accomplishing the task at hand.
We propose CMDBench, a benchmark modeling the complexity of enterprise data platforms.
arXiv Detail & Related papers (2024-06-02T01:10:41Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Can Large Language Models Serve as Data Analysts? A Multi-Agent Assisted Approach for Qualitative Data Analysis [4.539569292151314]
Large Language Models (LLMs) enable human-bot collaboration in Software Engineering (SE)<n>This study is to design and develop an LLM-based multi-agent system that synergizes human decision support with AI to automate various qualitative data analysis approaches.
arXiv Detail & Related papers (2024-02-02T13:10:46Z) - InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks [84.7788065721689]
In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks.
This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files.
Building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench.
arXiv Detail & Related papers (2024-01-10T19:04:00Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.