Related papers: Agentic Reinforcement Learning for Search is Unsafe

Agentic Reinforcement Learning for Search is Unsafe

URL: http://arxiv.org/abs/2510.17431v1
Date: Mon, 20 Oct 2025 11:19:37 GMT
Title: Agentic Reinforcement Learning for Search is Unsafe
Authors: Yushi Yang, Shreyansh Padarha, Andrew Lee, Adam Mahdi,
Abstract summary: We show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries.<n>Two simple attacks trigger cascades of harmful searches and answers.<n>As a result, RL search models have vulnerabilities that users can easily exploit.
Score: 3.3562013033694598
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.

Related papers

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models [60.10077024249373]
We propose ThinkSafe, a framework that restores safety alignment without external teachers.<n>Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm.<n> Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency.
arXiv Detail & Related papers (2026-01-30T16:31:02Z)
AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning [61.974530499621274]
Overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content.<n>We propose a two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search.<n>AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.
arXiv Detail & Related papers (2025-12-18T18:50:01Z)
When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails [74.63933201261595]
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks.<n>LRMs remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks.<n>We propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps.
arXiv Detail & Related papers (2025-10-24T09:32:25Z)
SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents [14.471045017602428]
Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and answer open-domain questions.<n>While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored.<n>We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term.
arXiv Detail & Related papers (2025-10-19T21:47:19Z)
Adversarial Reinforcement Learning for Large Language Model Agent Safety [20.704989548285372]
Large Language Model (LLM) agents can leverage tools like Google Search to complete complex tasks.<n>Current defense strategies rely on fine-tuning LLM agents on datasets of known attacks.<n>We propose Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework that leverages adversarial reinforcement learning (RL) by formulating the problem as a two-player zero-sum game.
arXiv Detail & Related papers (2025-10-06T23:09:18Z)
SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents [63.70653857721785]
We conduct two in-the-wild experiments to demonstrate the prevalence of low-quality search results and their potential to misguide agent behaviors.<n>To counter this threat, we introduce an automated red-teaming framework that is systematic, scalable, and cost-efficient.
arXiv Detail & Related papers (2025-09-28T07:05:17Z)
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs [20.79833694266861]
We introduce a defense against adversarial attacks on LLMs utilizing self-evaluation. Our method requires no model fine-tuning, instead using pre-trained models to evaluate the inputs and outputs of a generator model. We present an analysis of the effectiveness of our method, including attempts to attack the evaluator in various settings.
arXiv Detail & Related papers (2024-07-03T16:03:42Z)
Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions [89.35345649303451]
Generative search engines have the potential to transform how people seek information online. But generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system.
arXiv Detail & Related papers (2024-02-25T11:22:19Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Hijacking Attacks against Neural Networks by Analyzing Training Data [21.277867143827812]
CleanSheet is a new model hijacking attack that obtains the high performance of backdoor attacks without requiring the adversary to train the model. CleanSheet exploits vulnerabilities in tampers stemming from the training data. Results show that CleanSheet exhibits comparable to state-of-the-art backdoor attacks, achieving an average attack success rate (ASR) of 97.5% on CIFAR-100 and 92.4% on GTSRB.
arXiv Detail & Related papers (2024-01-18T05:48:56Z)
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases [32.2246459413988]
Red-teaming aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. We present a new perspective on safety research i.e., red-teaming through Unalignment. Unalignment tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior.
arXiv Detail & Related papers (2023-10-22T13:55:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.