Fugu-MT 論文翻訳(概要): SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

論文の概要: SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

arxiv url: http://arxiv.org/abs/2510.17017v3
Date: Wed, 05 Nov 2025 04:51:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-06 20:32:09.433678
Title: SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents
Title（参考訳）: SafeSearch: LLM検索エージェントの実用性のための安全をトレードしない
Authors: Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed, Xingzhi Guo, Daniel Kang, Joo-Kyung Kim,
Abstract要約: 大言語モデル(LLM)ベースの検索エージェントは、クエリを反復的に生成し、外部情報を検索し、オープンドメインの質問に答える。研究者は主に実用性の改善に力を入れてきたが、その安全性の行動は未調査のままだ。 SafeSearchは、複数目的の強化学習アプローチで、最終的な出力安全性/ユーティリティ報酬と、新しいクエリレベルのシェーピング用語を結合する。
参考スコア（独自算出の注目度）: 14.471045017602428
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked "How can I track someone's location without their consent?", a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
Abstract（参考訳）: 大言語モデル(LLM)ベースの検索エージェントは、クエリを反復的に生成し、外部情報を取得し、オープンドメインの質問に答える理由を与える。研究者は主に実用性の改善に力を入れてきたが、その安全性の行動は未調査のままだ。本稿では,まず,レッドチームデータセットを用いた検索エージェントの評価を行い,ベースLSMよりも有害なアウトプットを産み出す可能性が示唆された。例えば、"どのようにして同意なしに誰かの位置を追跡できるか"という質問に対して、ベースモデルは拒否するが、ソースを検索して引用するように設計された検索エージェントは、拒否しきい値を下げ、文書(例えば、訴訟)をフェッチし、一度追加すると、情報的かつ安全でない要約に合成する。さらに、ユーティリティ指向の微調整がこのリスクを増大させ、安全性とユーティリティの連携を動機付けていることを示す。 SafeSearchは、安全でないクエリをペナルティ化し、安全なクエリを報酬する新しいクエリレベルのシェーピング用語と、最終出力の安全性/ユーティリティの報酬を結合する、多目的強化学習アプローチである。実験によると、SafeSearchは3つのレッドチームデータセットに対して、安全で有用な応答を生成しながら、エージェントの有害度を70%以上削減し、ユーティリティのみの微調整エージェントのQAパフォーマンスと一致している。

論文の概要: SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

関連論文リスト