Related papers: From Features to Actions: Explainability in Traditional and Agentic AI Systems

From Features to Actions: Explainability in Traditional and Agentic AI Systems

URL: http://arxiv.org/abs/2602.06841v2
Date: Mon, 09 Feb 2026 17:37:05 GMT
Title: From Features to Actions: Explainability in Traditional and Agentic AI Systems
Authors: Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza,
Abstract summary: We bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics.<n>Our results show that trace-based diagnostics for agentic settings consistently localizes behaviour breakdowns.
Score: 8.859406164948718
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman $ρ= 0.86$), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7$\times$ more prevalent in failed runs and reduces success probability by 49\%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

Related papers

Towards More Standardized AI Evaluation: From Models to Agents [0.0]
As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function.<n>Most evaluation practices remain anchored in assumptions inherited from the model-centric era.<n>This paper argues that such approaches are increasingly obscure rather than illuminating system behavior.
arXiv Detail & Related papers (2026-02-20T06:54:44Z)
Agentic Test-Time Scaling for WebAgents [65.5178428849495]
We present Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious.<n>CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling.
arXiv Detail & Related papers (2026-02-12T18:58:30Z)
AgentRx: Diagnosing AI Agent Failures from Execution Trajectories [9.61742219198197]
We release a benchmark of 115 failed trajectories spanning structured API, incident management, and open-ended web/file tasks.<n>Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy.<n>We present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory.
arXiv Detail & Related papers (2026-02-02T18:54:07Z)
The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check [54.08619694620588]
We present a comprehensive evaluation of dLLMs across two distinct agentic paradigms: Embodied Agents and Tool-Calling Agents.<n>Our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones.
arXiv Detail & Related papers (2026-01-19T11:45:39Z)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing [12.835224376066769]
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors.<n>We introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients.<n>We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination.
arXiv Detail & Related papers (2025-09-26T12:07:47Z)
Robust Root Cause Diagnosis using In-Distribution Interventions [31.19149413954674]
Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today's cloud services and industrial operations.<n>We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria.
arXiv Detail & Related papers (2025-05-02T00:19:43Z)
ALMANACS: A Simulatability Benchmark for Language Model Explainability [9.037709044327066]
We present ALMANACS, a language model explainability benchmark.<n>AlMANACS scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs.<n>By using another language model to predict behavior based on the explanations, ALMANACS is a fully automated benchmark.
arXiv Detail & Related papers (2023-12-20T03:44:18Z)
A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories [122.11358440078581]
offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. We propose Trajectory-Aware Learning from Observations (TAILO) to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available.
arXiv Detail & Related papers (2023-11-02T15:41:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.