ACEBench: Who Wins the Match Point in Tool Usage?
- URL: http://arxiv.org/abs/2501.12851v3
- Date: Thu, 13 Feb 2025 12:43:59 GMT
- Title: ACEBench: Who Wins the Match Point in Tool Usage?
- Authors: Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Wulong Liu, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, Wu Liu,
- Abstract summary: ACEBench is a comprehensive benchmark for assessing tool usage in Large Language Models (LLMs)
It categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent.
It provides a more granular examination of error causes across different data types.
- Score: 68.54159348899891
- License:
- Abstract: Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs' tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
Related papers
- Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.
MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools.
Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z) - Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.
We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.
We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z) - ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities [30.030101957186595]
ToolSandbox is an evaluation framework for large language models (LLMs)
ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation.
We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs.
arXiv Detail & Related papers (2024-08-08T05:45:42Z) - GTA: A Benchmark for General Tool Agents [32.443456248222695]
We design 229 real-world tasks and executable tool chains to evaluate mainstream large language models (LLMs)
Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%.
This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents.
arXiv Detail & Related papers (2024-07-11T17:50:09Z) - Can Tool-augmented Large Language Models be Aware of Incomplete Conditions? [33.74511128798095]
This study examines whether large language models can identify incomplete conditions and appropriately determine when to refrain from using tools.
Our experiments show that LLMs often struggle to identify the absence of information required to utilize specific tools.
Our research can contribute to advancing reliable LLMs by addressing common scenarios during interactions between humans and LLMs.
arXiv Detail & Related papers (2024-06-18T06:28:06Z) - Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models.
Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions.
We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z) - Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization.
It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving.
A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z) - MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language
Feedback [78.60644407028022]
We introduce MINT, a benchmark that evaluates large language models' ability to solve tasks with multi-turn interactions.
LLMs generally benefit from tools and language feedback, with performance gains of 1-8% for each turn of tool use.
LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
arXiv Detail & Related papers (2023-09-19T15:25:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.