LLM/Agent-as-Data-Analyst: A Survey
- URL: http://arxiv.org/abs/2509.23988v2
- Date: Wed, 15 Oct 2025 03:55:52 GMT
- Title: LLM/Agent-as-Data-Analyst: A Survey
- Authors: Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Xue Yang, Bin Wang, Conghui He, Xiaoyang Wang, Fan Wu,
- Abstract summary: Large language model (LLM) and agent techniques for data analysis have demonstrated substantial impact in both academica and industry.<n>The technical evolution further distills five key design goals for intelligent data analysis agents, namely semantic-aware design, hybrid integration, autonomous pipelines, tool-augmented modality, and support for open-world tasks.
- Score: 54.01326293336748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language model (LLM) and agent techniques for data analysis (a.k.a LLM/Agent-as-Data-Analyst) have demonstrated substantial impact in both academica and industry. In comparison with traditional rule or small-model based approaches, (agentic) LLMs enable complex data understanding, natural language interfaces, semantic analysis functions, and autonomous pipeline orchestration. The technical evolution further distills five key design goals for intelligent data analysis agents, namely semantic-aware design, modality-hybrid integration, autonomous pipelines, tool-augmented workflows, and support for open-world tasks. From a modality perspective, we review LLM-based techniques for (i) structured data (e.g., table question answering for relational data and NL2GQL for graph data), (ii) semi-structured data (e.g., markup languages understanding and semi-structured table modeling), (iii) unstructured data (e.g., chart understanding, document understanding, programming languages vulnerable detection), and (iv) heterogeneous data (e.g., data retrieval and modality alignment for data lakes). Finally, we outline the remaining challenges and propose several insights and practical directions for advancing LLM/Agent-powered data analysis.
Related papers
- Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z) - LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology [3.470217255779291]
We introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis.<n>Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries.<n> Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful agent responses.
arXiv Detail & Related papers (2025-09-17T13:51:29Z) - From Parameters to Performance: A Data-Driven Study on LLM Structure and Development [73.67759647072519]
Large language models (LLMs) have achieved remarkable success across various domains.<n>Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce.<n>We present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks.
arXiv Detail & Related papers (2025-09-14T12:20:39Z) - Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study [55.09905978813599]
Large Language Models (LLMs) hold promise in automating data analysis tasks.<n>Yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios.<n>In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs.
arXiv Detail & Related papers (2025-06-24T17:04:23Z) - A Survey of LLM $\times$ DATA [71.96808497574658]
The integration of large language model (LLM) and data management ( DATA4LLM) is rapidly redefining both domains.<n>On the one hand, DATA4LLM feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic.<n>On the other hand, LLMs are emerging as general-purpose engines for data management.
arXiv Detail & Related papers (2025-05-24T01:57:12Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes [25.05627023905607]
We envision a new multi-modal data analytics system based on the Model Context Protocol (MCP)<n>First, we define a semantic operator hierarchy tailored for querying multi-modal data in data lakes.<n>Next, we introduce an MCP-based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities.
arXiv Detail & Related papers (2025-05-16T14:03:30Z) - CoddLLM: Empowering Large Language Models for Data Analytics [38.23203246023766]
Large Language Models (LLMs) have the potential to revolutionize data analytics.<n>We unveil a new data recipe for post-Turbo synthesiss.<n>We posttrain a new foundation model, named CoddLLM, based on MistralNeMo-12B.
arXiv Detail & Related papers (2025-02-01T06:03:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.