InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
- URL: http://arxiv.org/abs/2401.05507v3
- Date: Mon, 11 Mar 2024 07:57:59 GMT
- Title: InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
- Authors: Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang,
Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li,
Kun Kuang, Yang Yang, Hongxia Yang, Fei Wu
- Abstract summary: In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks.
This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files.
Building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench.
- Score: 84.7788065721689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce InfiAgent-DABench, the first benchmark
specifically designed to evaluate LLM-based agents on data analysis tasks.
These tasks require agents to end-to-end solving complex tasks by interacting
with an execution environment. This benchmark contains DAEval, a dataset
consisting of 257 data analysis questions derived from 52 CSV files, and an
agent framework which incorporates LLMs to serve as data analysis agents for
both serving and evaluation. Since data analysis questions are often open-ended
and hard to evaluate without human supervision, we adopt a format-prompting
technique to convert each question into a closed-form format so that they can
be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the
current challenges encountered in data analysis tasks. In addition, building on
top of our agent framework, we develop a specialized agent, DAAgent, which
surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for
InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent .
Related papers
- InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation [81.4242018694792]
We introduce InsightBench, a benchmark dataset with three key features.
It consists of 31 datasets representing diverse business use cases such as finance and incident management.
Unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics.
arXiv Detail & Related papers (2024-07-08T22:06:09Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents [46.81304373693033]
Large language models (LLMs) have become a research hotspot in human-computer interaction.
Mobile-Bench is a novel benchmark for evaluating the capabilities of LLM-based mobile agents.
arXiv Detail & Related papers (2024-07-01T06:10:01Z) - ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions [68.81939215223818]
ProductAgent is a conversational information seeking agent equipped with abilities of strategic clarification question generation and dynamic product retrieval.
We develop the agent with strategies for product feature summarization, query generation, and product retrieval.
Experiments show that ProductAgent interacts positively with the user and enhances retrieval performance with increasing dialogue turns.
arXiv Detail & Related papers (2024-07-01T03:50:23Z) - DCA-Bench: A Benchmark for Dataset Curation Agents [9.60250892491588]
We propose a dataset curation agent benchmark, DCA-Bench, to measure large language models' capability of detecting hidden dataset quality issues.
Specifically, we collect diverse real-world dataset quality issues from eight open dataset platforms as a testbed.
The proposed benchmark can also serve as a testbed for measuring the capability of LLMs in problem discovery rather than just problem-solving.
arXiv Detail & Related papers (2024-06-11T14:02:23Z) - CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems [10.71630696651595]
Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks have garnered significant interest within database and AI communities.
silos of multimodal data sources make it difficult to identify appropriate data sources for accomplishing the task at hand.
We propose CMDBench, a benchmark modeling the complexity of enterprise data platforms.
arXiv Detail & Related papers (2024-06-02T01:10:41Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via
Code Generation [86.4326416303723]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks.
MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.