Fugu-MT 論文翻訳(概要): Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

論文の概要: Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

arxiv url: http://arxiv.org/abs/2603.14355v1
Date: Sun, 15 Mar 2026 12:45:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.767448
Title: Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling
Title（参考訳）: 大言語モデルにおける高能率逆応答サンプリングによる長距離安全故障の抽出
Authors: Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty,
Abstract要約: 本研究は, 各種応答生成(アウトプット空間探索)によって安全障害を系統的に暴露し, 固定された安全クリティカルプロンプトを提案する。本稿では,トークンレベルのサンプリングと多様性を意識した選択を組み合わせたプログレッシブ・ディバース・ポピュレーション・サンプリングを提案する。大規模IIDサンプリングに匹敵する攻撃成功率を実現し、計算コストの8%から29%しか使用していない。
参考スコア（独自算出の注目度）: 16.855507865785345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.
Abstract（参考訳）: 教師付き微調整と人間からのフィードバックによる強化学習による安全性チューニングは,大規模言語モデル(LLM)の堅牢性を大幅に向上させた。しかし、安全でない振る舞いを排除せず、しばしば抑制し、出力分布の長い尾に隠された稀だが重要な障害を残す。多くの赤チームの作業は、敵のプロンプト探索(インプット-スペース最適化)を強調しているが、固定された安全クリティカルなプロンプトに対して、多様な応答生成(アウトプット-スペース探索)を通じて安全障害を体系的に露呈できることを示し、サンプル応答の数と多様性を増大させることで、ジェイルブレイクの成功率をユニティに近づけることができる。このような障害を効果的に発見するために,確率的トークンレベルサンプリングと多様性を考慮した選択を組み合わせたプログレッシブ・ディバース・ポピュレーション・サンプリング(PDPS)を提案する。複数のjailbreakベンチマークとオープンソースのLLMで、PDPSは大規模IIDサンプリングに匹敵する攻撃成功率を達成し、計算コストの8%から29%しか使用していない。限定応答設定では、IDサンプリングや横ビームサーチよりも成功率を26%から40%向上させる。さらに、PDPSが生成した応答は、より多数の安全でない出力の多様性を示し、より広い範囲の障害を明らかにする効果を示す。

論文の概要: Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

関連論文リスト