Related papers: AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

URL: http://arxiv.org/abs/2602.11348v2
Date: Wed, 18 Feb 2026 01:36:58 GMT
Title: AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition
Authors: Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, Xunliang Cai, Tat-Seng Chua,
Abstract summary: We introduce AgentNoiseBench, a framework for evaluating robustness of agentic models under noisy environments.<n>We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios.<n>We then categorize environmental noise into two primary types: user-noise and tool-noise.<n>Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks.
Score: 72.24180896265192
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation paradigms are typically built on idealized assumptions, overlooking the inherent stochasticity and noise present in real-world interactions. To bridge this gap, we introduce AgentNoiseBench, a framework for systematically evaluating the robustness of agentic models under noisy environments. We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios and categorize environmental noise into two primary types: user-noise and tool-noise. Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks while preserving task solvability. Leveraging this pipeline, we perform extensive evaluations across a wide range of models with diverse architectures and parameter scales. Our results reveal consistent performance variations under different noise conditions, highlighting the sensitivity of current agentic models to realistic environmental perturbations.

Related papers

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z)
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors [57.31788955167306]
Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information.<n>We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks.<n>Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors.
arXiv Detail & Related papers (2026-01-12T05:43:51Z)
Grounded Test-Time Adaptation for LLM Agents [75.62784644919803]
Large language model (LLM)-based agents struggle to generalize to novel and complex environments.<n>We propose two strategies for adapting LLM agents by leveraging environment-specific information available during deployment.
arXiv Detail & Related papers (2025-11-06T22:24:35Z)
RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z)
Impact of Noise on LLM-Models Performance in Abstraction and Reasoning Corpus (ARC) Tasks with Model Temperature Considerations [4.39614901077936]
Large Language Models (LLMs) have generated growing interest in their structured reasoning capabilities.<n>The Abstraction and Reasoning Corpus benchmark plays a crucial role in evaluating these capabilities by testing how well AI models generalize to novel problems.<n>This work underscores the need for developing more robust and adaptable AI systems capable of handling the ambiguity and variability inherent in real-world scenarios.
arXiv Detail & Related papers (2025-04-22T13:43:58Z)
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking [32.52171076424419]
Embodied agents require robust navigation systems to operate in unstructured environments. We propose a novel, customizable pipeline for noisy data synthesis. Our analysis uncovers the susceptibilities of both neural (NeRF) and non-neural SLAM models to disturbances.
arXiv Detail & Related papers (2024-06-24T17:57:05Z)
$\text{R}^2$-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations [36.74309198908876]
We present a comprehensive taxonomy of perturbations, and then develop a versatile toolbox for synthesizing and evaluating the effects of composite disturbances. We propose the $textR2$-Agent, an LLM-based agent that simplifies and automates model evaluation via natural language instructions.
arXiv Detail & Related papers (2024-03-07T22:18:12Z)
Customizable Perturbation Synthesis for Robust SLAM Benchmarking [33.74471840597803]
We propose a novel, customizable pipeline for noisy data synthesis. This pipeline incorporates customizable hardware setups, software components, and perturbed environments. We instantiate the Robust-SLAM benchmark, which includes diverse perturbation types, to evaluate the risk tolerance of existing advanced SLAM models.
arXiv Detail & Related papers (2024-02-12T23:49:40Z)
Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment. We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.