Related papers: Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

URL: http://arxiv.org/abs/2510.04491v1
Date: Mon, 06 Oct 2025 05:03:57 GMT
Title: Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents
Authors: Muyu He, Anand Kumar, Tsach Mackey, Meghana Rajeev, James Zou, Nazneen Rajani,
Abstract summary: TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
Score: 58.00130492861884
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today's benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend $\tau$-Bench to $\tau$-Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on $\tau$-Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced $\tau$-Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.

Related papers

TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning [4.928838343487574]
Existing uncertainty proxies focus on single-shot text generation.<n>We introduce TRACER, a trajectory-level uncertainty metric for dual-control Tool-Agent-User interaction.
arXiv Detail & Related papers (2026-02-11T22:23:56Z)
From Features to Actions: Explainability in Traditional and Agentic AI Systems [8.859406164948718]
We bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics.<n>Our results show that trace-based diagnostics for agentic settings consistently localizes behaviour breakdowns.
arXiv Detail & Related papers (2026-02-06T16:34:29Z)
AI IDEs or Autonomous Agents? Measuring the Impact of Coding Agents on Software Development [12.50615284537175]
Large language model (LLM) based coding agents increasingly act as autonomous contributors that generate and merge pull requests.<n>We present a longitudinal causal study of agent adoption in open-source repositories using staggered difference-in-differences with matched controls.
arXiv Detail & Related papers (2026-01-20T04:51:56Z)
ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration [68.89572566071575]
ETAgent is a training framework for calibrating agent's tool-use behavior.<n>It is designed to progressively calibrate erroneous behavioral patterns to optimal behaviors.
arXiv Detail & Related papers (2026-01-11T11:05:26Z)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z)
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails [103.05296856071931]
We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving Large Language Model (LLM) agents.<n>ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies.<n>Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states.
arXiv Detail & Related papers (2025-10-06T14:48:39Z)
$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment [32.345011712015435]
Existing benchmarks for AI agents simulate single-control environments.<n>We introduce $tau2$-bench, where both agent and user make use of tools to act in a shared, dynamic environment.<n>In particular, our experiments show significant performance drops when agents shift from no-user to dual-control.
arXiv Detail & Related papers (2025-06-09T17:52:18Z)
Building reliable sim driving agents by scaling self-play [3.3378669626639423]
Simulation agents are essential for designing and testing systems that interact with humans, such as autonomous vehicles (AVs)<n>We propose scaling self-play to thousands of scenarios on the Open Motion dataset under semi-realistic limits on human perception and control.<n>We generalize to unseen test scenes, achieving a 99.8% goal completion rate with less than 0.8% combined collision and off-road incidents.
arXiv Detail & Related papers (2025-02-20T16:30:45Z)
Autonomous Vehicle Controllers From End-to-End Differentiable Simulation [57.278726604424556]
We propose a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers.<n>Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of environment dynamics serve as a useful prior to help the agent learn a more grounded policy.<n>We find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.
arXiv Detail & Related papers (2024-09-12T11:50:06Z)
Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors. Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z)
User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors. Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.