Fugu-MT 論文翻訳(概要): Deep Research Brings Deeper Harm

論文の概要: Deep Research Brings Deeper Harm

arxiv url: http://arxiv.org/abs/2510.11851v1
Date: Mon, 13 Oct 2025 19:05:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.069987
Title: Deep Research Brings Deeper Harm
Title（参考訳）: より深いハームをもたらすDeep Research
Authors: Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, Volker Tresp, Jindong Gu,
Abstract要約: LLM(Large Language Models)上に構築されたDeep Research (DR)エージェントは、複雑な多段階の研究を行うことができる。これは特に、バイオセキュリティのような高度な知識集約ドメインにおいて関係している。エージェントの計画に悪意あるサブゴールを注入するプランインジェクション(Plan Injection)と、有害なクエリを学術研究の質問として再編成するIntent Hijack(Intent Hijack)という2つの新しいジェイルブレイク戦略を提案する。
参考スコア（独自算出の注目度）: 64.71728362573624
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep Research (DR) agents built on Large Language Models (LLMs) can perform complex, multi-step research by decomposing tasks, retrieving online information, and synthesizing detailed reports. However, the misuse of LLMs with such powerful capabilities can lead to even greater risks. This is especially concerning in high-stakes and knowledge-intensive domains such as biosecurity, where DR can generate a professional report containing detailed forbidden knowledge. Unfortunately, we have found such risks in practice: simply submitting a harmful query, which a standalone LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods designed for LLMs fall short in exposing such unique risks, as they do not target the research ability of DR agents. To address this gap, we propose two novel jailbreak strategies: Plan Injection, which injects malicious sub-goals into the agent's plan; and Intent Hijack, which reframes harmful queries as academic research questions. We conducted extensive experiments across different LLMs and various safety benchmarks, including general and biosecurity forbidden prompts. These experiments reveal 3 key findings: (1) Alignment of the LLMs often fail in DR agents, where harmful prompts framed in academic terms can hijack agent intent; (2) Multi-step planning and execution weaken the alignment, revealing systemic vulnerabilities that prompt-level safeguards cannot address; (3) DR agents not only bypass refusals but also produce more coherent, professional, and dangerous content, compared with standalone LLMs. These results demonstrate a fundamental misalignment in DR agents and call for better alignment techniques tailored to DR agents. Code and datasets are available at https://chenxshuo.github.io/deeper-harm.
Abstract（参考訳）: LLM(Large Language Models)上に構築されたDeep Research (DR)エージェントは、タスクを分解し、オンライン情報を取得し、詳細なレポートを合成することで、複雑な多段階の研究を行うことができる。しかし、そのような強力な能力を持つLLMの誤用は、さらに大きなリスクをもたらす可能性がある。これは特にバイオセキュリティのような高度な知識集約ドメインにおいて関係しており、DRは詳細な禁じられた知識を含む専門的なレポートを生成することができる。 LLMが直接拒否する有害なクエリを単に提出するだけで、DRエージェントから詳細で危険なレポートを導き出せるのです。これは高いリスクを強調し、より深い安全分析の必要性を強調します。しかし、LDM向けに設計されたジェイルブレイク法は、DRエージェントの研究能力を目標としないため、そのようなユニークなリスクを露呈するのに不足している。このギャップに対処するために、エージェントの計画に悪意あるサブゴールを注入するプランインジェクション(Plan Injection)と、有害なクエリを学術研究の質問として再編成するIntent Hijack(Intent Hijack)という2つの新しいジェイルブレイク戦略を提案する。一般およびバイオセキュリティ禁止プロンプトを含む,様々なLSMおよび各種安全ベンチマークの広範な実験を行った。これらの実験は,(1) DRエージェントにおいてLLMのアライメントが失敗することが多いこと,(2) 多段階の計画と実行がアライメントを弱めること,(3) DRエージェントは拒絶をバイパスするだけでなく,スタンドアローンのLLMよりも一貫性,専門的,危険な内容を生み出すこと,の3つの重要な知見を提示した。これらの結果から,DR剤の基本的な相違が示され,DR剤に適したアライメント技術が求められた。コードとデータセットはhttps://chenxshuo.github.io/deeper-harm.orgで公開されている。

論文の概要: Deep Research Brings Deeper Harm

関連論文リスト