Related papers: Safety Devolution in AI Agents

Safety Devolution in AI Agents

URL: http://arxiv.org/abs/2505.14215v1
Date: Tue, 20 May 2025 11:21:40 GMT
Title: Safety Devolution in AI Agents
Authors: Cheng Yu, Benedikt Stroebl, Diyi Yang, Orestis Papakyriakopoulos,
Abstract summary: This study investigates how expanding retrieval access affects model reliability, bias propagation, and harmful content generation.<n>Retrieval-augmented agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval.<n>These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-augmented and increasingly autonomous AI systems.
Score: 56.482973617087254
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As retrieval-augmented AI agents become more embedded in society, their safety properties and ethical behavior remain insufficiently understood. In particular, the growing integration of LLMs and AI agents raises critical questions about how they engage with and are influenced by their environments. This study investigates how expanding retrieval access, from no external sources to Wikipedia-based retrieval and open web search, affects model reliability, bias propagation, and harmful content generation. Through extensive benchmarking of censored and uncensored LLMs and AI Agents, our findings reveal a consistent degradation in refusal rates, bias sensitivity, and harmfulness safeguards as models gain broader access to external sources, culminating in a phenomenon we term safety devolution. Notably, retrieval-augmented agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval. This effect persists even under strong retrieval accuracy and prompt-based mitigation, suggesting that the mere presence of retrieved content reshapes model behavior in structurally unsafe ways. These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-augmented and increasingly autonomous AI systems.

Related papers

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z)
The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents [0.0]
Language agents increasingly act as web-enabled systems that search, browse, and synthesize information from diverse sources.<n>These sources can include unreliable or adversarial content, and the robustness of agents to adversarial ranking remains poorly understood.<n>We introduce Synthetic Web Benchmark, a procedurally generated environment comprising thousands of hyperlinked articles with ground-truth labels for credibility and factuality.
arXiv Detail & Related papers (2026-02-28T20:27:44Z)
Hallucination Detection and Mitigation in Large Language Models [0.0]
Large Language Models (LLMs) and Large Reasoning Models (LRMs) offer transformative potential for high-stakes domains like finance and law.<n>Their tendency to hallucinate, generating factually incorrect or unsupported content, poses a critical reliability risk.<n>This paper introduces a comprehensive framework for hallucination management, built on a continuous improvement cycle driven by root cause awareness.
arXiv Detail & Related papers (2026-01-14T23:19:37Z)
Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models [14.166203096918247]
Grounding responses in external knowledge represents an effective strategy for mitigating hallucinations in Large Language Models.<n>Current LLMs struggle to seamlessly integrate knowledge while simultaneously maintaining faithfulness and expressiveness.<n>We propose Collaborative Decoding, a novel approach that dynamically integrates output probabilities generated with and without external knowledge.
arXiv Detail & Related papers (2025-08-26T03:48:05Z)
A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents [45.53643260046778]
Recent advances in large language models (LLMs) have catalyzed the rise of autonomous AI agents.<n>These large-model agents mark a paradigm shift from static inference systems to interactive, memory-augmented entities.
arXiv Detail & Related papers (2025-06-30T13:34:34Z)
Expert-in-the-Loop Systems with Cross-Domain and In-Domain Few-Shot Learning for Software Vulnerability Detection [38.083049237330826]
This study explores the use of Large Language Models (LLMs) in software vulnerability assessment by simulating the identification of Python code with known Common Weaknessions (CWEs)<n>Our results indicate that while zero-shot prompting performs poorly, few-shot prompting significantly enhances classification performance.<n> challenges such as model reliability, interpretability, and adversarial robustness remain critical areas for future research.
arXiv Detail & Related papers (2025-06-11T18:43:51Z)
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation [63.55258191625131]
InfoDeepSeek is a new benchmark for assessing agentic information seeking in real-world, dynamic web environments.<n>We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity.<n>We develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes.
arXiv Detail & Related papers (2025-05-21T14:44:40Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
Information Retrieval in the Age of Generative AI: The RGB Model [77.96475639967431]
This paper presents a novel quantitative approach to shed light on the complex information dynamics arising from the growing use of generative AI tools.<n>We propose a model to characterize the generation, indexing, and dissemination of information in response to new topics.<n>Our findings suggest that the rapid pace of generative AI adoption, combined with increasing user reliance, can outpace human verification, escalating the risk of inaccurate information proliferation.
arXiv Detail & Related papers (2025-04-29T10:21:40Z)
ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation [91.20492150248106]
We investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases.<n>We propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs.<n> Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory.
arXiv Detail & Related papers (2025-02-21T15:50:41Z)
TrustRAG: Enhancing Robustness and Trustworthiness in RAG [31.231916859341865]
TrustRAG is a framework that systematically filters compromised and irrelevant contents before they are retrieved for generation.<n>TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches.
arXiv Detail & Related papers (2025-01-01T15:57:34Z)
Towards More Robust Retrieval-Augmented Generation: Evaluating RAG Under Adversarial Poisoning Attacks [45.07581174558107]
Retrieval-Augmented Generation (RAG) systems have emerged as a promising solution to mitigate hallucinations.<n>RAG systems are vulnerable to adversarial poisoning attacks, where malicious passages injected into retrieval databases can mislead the model into generating factually incorrect outputs.<n>This paper investigates both the retrieval and the generation components of RAG systems to understand how to enhance their robustness against such attacks.
arXiv Detail & Related papers (2024-12-21T17:31:52Z)
Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z)
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z)
Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions [89.35345649303451]
Generative search engines have the potential to transform how people seek information online. But generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system.
arXiv Detail & Related papers (2024-02-25T11:22:19Z)
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science [65.77763092833348]
Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents also introduce novel vulnerabilities that demand careful consideration for safety. This paper conducts a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures.
arXiv Detail & Related papers (2024-02-06T18:54:07Z)
How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors. Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.