Related papers: Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology

Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology

URL: http://arxiv.org/abs/2602.08082v1
Date: Sun, 08 Feb 2026 18:56:16 GMT
Title: Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology
Authors: Valentin Noël,
Abstract summary: We propose a training free guardrail based on spectral analysis of attention topology that complements supervised approaches.<n>On Llama 3.1 8B, our method achieves 97.7% recall with multi-feature detection and 86.1% recall with 81.0% precision for balanced deployment.<n>We reveal the Loud Liar'' phenomenon: Llama 3.1 8B's failures are spectrally catastrophic and dramatically easier to detect, while Mistral 7B achieves the best discrimination.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deploying autonomous agents in the wild requires reliable safeguards against tool use failures. We propose a training free guardrail based on spectral analysis of attention topology that complements supervised approaches. On Llama 3.1 8B, our method achieves 97.7\% recall with multi-feature detection and 86.1\% recall with 81.0\% precision for balanced deployment, without requiring any labeled training data. Most remarkably, we discover that single layer spectral features act as near-perfect hallucination detectors: Llama L26 Smoothness achieves 98.2\% recall (213/217 hallucinations caught) with a single threshold, and Mistral L3 Entropy achieves 94.7\% recall. This suggests hallucination is not merely a wrong token but a thermodynamic state change: the model's attention becomes noise when it errs. Through controlled cross-model evaluation on matched domains ($N=1000$, $T=0.3$, same General domain, hallucination rates 20--22\%), we reveal the ``Loud Liar'' phenomenon: Llama 3.1 8B's failures are spectrally catastrophic and dramatically easier to detect, while Mistral 7B achieves the best discrimination (AUC 0.900). These findings establish spectral analysis as a principled, efficient framework for agent safety.

Related papers

BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning [73.46118996284888]
Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence.<n>We propose BadCLIP++, a unified framework that tackles both challenges.<n>For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions.<n>For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment.
arXiv Detail & Related papers (2026-02-19T08:31:16Z)
Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z)
Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models [44.41218866933059]
Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels.<n>We show previous schemes achieve 100% recoverability by replacing arbitrary mappings with embedding-space-derived ones.<n>We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis.
arXiv Detail & Related papers (2026-01-30T10:43:43Z)
Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning [0.0]
We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns.<n>The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy.<n>These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.
arXiv Detail & Related papers (2026-01-02T18:49:37Z)
The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems [0.0]
We apply hallucination prediction to RAG detection, transforming scores into decision sets with finite-sample coverage guarantees.<n>We analyze this failure through the lens of distributional tails, showing that while NLI models achieve acceptable AUC (0.81), the "hardest" hallucinations are semantically indistinguishable from faithful responses.
arXiv Detail & Related papers (2025-12-17T04:22:28Z)
HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems in the Legal Domain [28.691566712713808]
Large Language Models (LLMs) are widely used in industry but remain prone to hallucinations, limiting their reliability in critical applications.<n>This work addresses hallucination reduction in consumer grievance chatbots built using LLaMA 3.1 8B Instruct, a compact model frequently used in industry.<n>We develop HalluDetect, an LLM-based hallucination detection system that achieves an F1 score of 68.92% outperforming baseline detectors by 22.47%.
arXiv Detail & Related papers (2025-09-15T06:23:36Z)
Semantic Energy: Detecting LLM Hallucination Beyond Entropy [106.92072182161712]
Large Language Models (LLMs) are being increasingly deployed in real-world applications, but they remain susceptible to hallucinations.<n>Uncertainty estimation is a feasible approach to detect such hallucinations.<n>We introduce Semantic Energy, a novel uncertainty estimation framework.
arXiv Detail & Related papers (2025-08-20T07:33:50Z)
Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs [129.79394562739705]
Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as "hallucinations"<n>We propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently.<n> Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results.
arXiv Detail & Related papers (2025-05-26T14:28:37Z)
SelfCheckAgent: Zero-Resource Hallucination Detection in Generative Large Language Models [0.16385815610837165]
SelfCheckAgent is a novel framework integrating three different agents.<n>These agents provide a robust multi-dimensional approach to hallucination detection.<n>The framework also incorporates a triangulation strategy, which increases the strengths of the SelfCheckAgent.
arXiv Detail & Related papers (2025-02-03T20:42:32Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation [76.34411067299331]
Large language models often tend to 'hallucinate' which critically hampers their reliability. We propose an approach that actively detects and mitigates hallucinations during the generation process. We show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average.
arXiv Detail & Related papers (2023-07-08T14:25:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.