Related papers: RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

URL: http://arxiv.org/abs/2401.08326v3
Date: Sat, 21 Sep 2024 08:03:33 GMT
Title: RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning
Authors: Junjie Ye, Yilong Wu, Songyang Gao, Caishuang Huang, Sixian Li, Guanyu Li, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang,
Abstract summary: We introduce RoTBench, a benchmark for evaluating the robustness of Large Language Models in tool learning. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. We propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning.
Score: 45.39445027132887
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs' capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce RoTBench, a multi-level benchmark for evaluating the robustness of LLMs in tool learning. Specifically, we establish five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model's resilience across three critical phases: tool selection, parameter identification, and content filling. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. For instance, the performance of GPT-4 even drops significantly from 80.00 to 58.10 when there is no substantial change in manual accuracy. More surprisingly, the noise correction capability inherent in the GPT family paradoxically impedes its adaptability in the face of mild noise. In light of these findings, we propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning. The code and data are available at https://github.com/Junjie-Ye/RoTBench.

Related papers

RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use [50.52940111891476]
Large language models excel at basic reasoning but struggle with tasks that require interaction with external tools.<n>We present RLFactory, a plug-and-play reinforcement learning framework for multi-round tool use.
arXiv Detail & Related papers (2025-08-31T16:47:31Z)
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs [55.20230501807337]
We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
arXiv Detail & Related papers (2025-08-15T10:32:50Z)
CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios [30.20881816731553]
The ability of large language models to utilize external tools has enabled them to tackle an increasingly diverse range of tasks.<n>As the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors.<n>How to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning.
arXiv Detail & Related papers (2025-06-11T17:59:18Z)
FamilyTool: A Multi-hop Personalized Tool Use Benchmark [94.1158032740113]
We introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) FamilyTool challenges Large Language Models with queries spanning 1 to 3 relational hops. Experiments reveal significant performance gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-04-09T10:42:36Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use. MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools. Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
On Adversarial Robustness of Language Models in Transfer Learning [13.363850350446869]
We show that transfer learning, while improving standard performance metrics, often leads to increased vulnerability to adversarial attacks. Our findings demonstrate that larger models exhibit greater resilience to this phenomenon, suggesting a complex interplay between model size, architecture, and adaptation methods.
arXiv Detail & Related papers (2024-12-29T15:55:35Z)
TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use [72.32614703504122]
Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with environments.<n>Standard supervised fine-tuning approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use.<n>We propose TL-Training, a task-feature-based framework that mitigates the effects of suboptimal training data.
arXiv Detail & Related papers (2024-12-20T02:21:36Z)
From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions [60.733557487886635]
This paper focuses on bridging the comprehension gap between Large Language Models and external tools. We propose a novel framework, DRAFT, aimed at Dynamically refining tool documentation. Extensive experiments on multiple datasets demonstrate that DRAFT's iterative, feedback-based refinement significantly ameliorates documentation quality.
arXiv Detail & Related papers (2024-10-10T17:58:44Z)
Learning Evolving Tools for Large Language Models [44.25796648300785]
We propose ToolEVO to enhance the adaptive and reflective capabilities of large language models (LLMs) against tool variability. By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments. We also introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of tool variability.
arXiv Detail & Related papers (2024-10-09T07:14:45Z)
What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks [33.51887014808227]
This paper explores the impact of both internal and external factors on the performance of tool learning frameworks. We find several insightful conclusions for future work, including the observation that LLMs can benefit significantly from increased trial and exploration.
arXiv Detail & Related papers (2024-07-03T11:06:05Z)
Can large language models explore in-context? [87.49311128190143]
We deploy Large Language Models as agents in simple multi-armed bandit environments. We find that the models do not robustly engage in exploration without substantial interventions.
arXiv Detail & Related papers (2024-03-22T17:50:43Z)
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE) STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z)
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios [48.38419686697733]
We propose ToolEyes, a fine-grained system tailored for the evaluation of large language models' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning. ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world.
arXiv Detail & Related papers (2024-01-01T12:49:36Z)
Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z)
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.