Related papers: SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

URL: http://arxiv.org/abs/2509.23694v2
Date: Wed, 01 Oct 2025 01:48:11 GMT
Title: SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents
Authors: Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu,
Abstract summary: We conduct two in-the-wild experiments to demonstrate the prevalence of low-quality search results and their potential to misguide agent behaviors.<n>To counter this threat, we introduce an automated red-teaming framework that is systematic, scalable, and cost-efficient.
Score: 58.24401593597499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Search agents connect LLMs to the Internet, enabling access to broader and more up-to-date information. However, unreliable search results may also pose safety threats to end users, establishing a new threat surface. In this work, we conduct two in-the-wild experiments to demonstrate both the prevalence of low-quality search results and their potential to misguide agent behaviors. To counter this threat, we introduce an automated red-teaming framework that is systematic, scalable, and cost-efficient, enabling lightweight and harmless safety assessments of search agents. Building on this framework, we construct the SafeSearch benchmark, which includes 300 test cases covering five categories of risks (e.g., misinformation and indirect prompt injection). Using this benchmark, we evaluate three representative search agent scaffolds, covering search workflow, tool-calling, and deep research, across 7 proprietary and 8 open-source backend LLMs. Our results reveal substantial vulnerabilities of LLM-based search agents: when exposed to unreliable websites, the highest ASR reached 90.5% for GPT-4.1-mini under a search workflow setting. Moreover, our analysis highlights the limited effectiveness of common defense practices, such as reminder prompting. This emphasizes the value of our framework in promoting transparency for safer agent development. Our codebase and test cases are publicly available: https://github.com/jianshuod/SafeSearch.

Related papers

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z)
SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents [14.471045017602428]
Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and answer open-domain questions.<n>While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored.<n>We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term.
arXiv Detail & Related papers (2025-10-19T21:47:19Z)
CREST-Search: Comprehensive Red-teaming for Evaluating Safety Threats in Large Language Models Powered by Web Search [28.45573025341277]
Large Language Models (LLMs) excel at tasks such as dialogue, summarization, and question answering.<n>To overcome this, web search has been integrated into LLMs, allowing real-time access to online content.<n>This connection magnifies safety risks, as adversarial prompts combined with untrusted sources can cause severe vulnerabilities.<n>We present CREST-Search, a framework that systematically exposes risks in such systems.
arXiv Detail & Related papers (2025-10-09T09:44:14Z)
A Survey on Agentic Security: Applications, Threats and Defenses [6.83318476483428]
The rapid shift from passive LLMs to autonomous LLM-agents marks a new paradigm in cybersecurity.<n>While these agents can act as powerful tools for both offensive and defensive operations, the very agentic context introduces a new class of inherent security risks.<n>We provide a comprehensive taxonomy of over 150 papers, explaining how agents are used, the vulnerabilities they possess, and the countermeasures designed to protect them.
arXiv Detail & Related papers (2025-10-07T20:32:20Z)
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z)
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z)
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents [58.65256663334316]
We present SafeAgentBench -- the first benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments.<n>SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives.
arXiv Detail & Related papers (2024-12-17T18:55:58Z)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.<n>We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.<n>We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.