Fugu-MT 論文翻訳(概要): RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

論文の概要: RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

arxiv url: http://arxiv.org/abs/2510.13901v1
Date: Tue, 14 Oct 2025 19:33:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.526354
Title: RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
Title（参考訳）: RAID: 脱獄 LLM の拒否認識と統合復号化
Authors: Tuan T. Nguyen, John Le, Thai T. Vu, Willy Susilo, Heath Cooper,
Abstract要約: RAID(Refusal-Aware and Integrated Decoding)は、拡散を保ちながら制限されたコンテンツを誘導する敵の接尾辞を作成するフレームワークである。 RAIDは,最近のホワイトボックスやブラックボックスのベースラインよりもクエリが少なく,計算コストも低く,攻撃成功率が高いことを示す。
参考スコア（独自算出の注目度）: 17.313975711973374
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.
Abstract（参考訳）: 大規模言語モデル(LLM)は、さまざまなタスクにわたって優れたパフォーマンスを達成するが、安全メカニズムをバイパスするジェイルブレイク攻撃には脆弱である。 RAID(Refusal-Aware and Integrated Decoding)は,フラエンシを保ちながら制限されたコンテンツを誘導する逆接尾辞を作成することで,これらの弱点を体系的に調査するフレームワークである。 RAIDは離散トークンを連続的な埋め込みに緩和し、それらを共同目的で最適化する (i)制限された応答を奨励する。 (二)レギュレータを組み込んで、埋め込み空間における拒絶方向から活性化を制御し、 (iii)意味的妥当性と非冗長性を維持するためにコヒーレンス項を適用する。最適化後、批評家誘導の復号手順は、埋め込み親和性と言語モデルの可能性のバランスをとることによって、埋め込みをトークンにマップする。この統合は、防御をバイパスし、自然に形を変えるのに効果的である接尾辞をもたらす。複数のオープンソースのLCMの実験により、RAIDは最近のホワイトボックスやブラックボックスのベースラインよりも少ないクエリと計算コストで高い攻撃成功率を達成することが示された。これらの知見は, LLMjailbreak脆弱性の理解と緩和のための埋め込み空間正規化の重要性を浮き彫りにした。

論文の概要: RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

関連論文リスト