Fugu-MT 論文翻訳(概要): Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

論文の概要: Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

arxiv url: http://arxiv.org/abs/2510.18081v1
Date: Mon, 20 Oct 2025 20:18:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:12.555372
Title: Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
Title（参考訳）: LLMの安全アライメントを、どんな奥行きでもアンロックできる「Ed-Depth Alignment」
Authors: Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu,
Abstract要約: 提案するAny-Depth Alignment(ADA)は,オーバーヘッドを無視できる効果的な推論時防御法である。 ADAは有害性を再評価し、世代毎に拒絶を回復するモデルを誘導する。数十から数千のトークンにわたる敵のプリフィル攻撃に対して、約100%の拒絶率を確保している。
参考スコア（独自算出の注目度）: 19.670368480802725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).
Abstract（参考訳）: 大きな言語モデル(LLM)は、強いが浅いアライメントを示す: アシスタントターンの開始時に拒否が期待されるときに、有害なクエリを直接拒否するが、この保護は有害な継続が進行中である(敵の攻撃または有害なアシスタントプリフィルアタックによって)と崩壊する。 LLMの自然の浅いアライメントは、任意の生成深度で安全性を確保するためにアンロックできるのか? この目的を達成するために、我々は無視可能なオーバーヘッドを持つ効果的な推論時防御であるAny-Depth Alignment (ADA)を提案する。 ADAは、浅層学習において繰り返し使用されることによって、アライメントがアシスタントヘッダトークンに集中しているという我々の観察に基づいて構築され、これらのトークンはモデルの強いアライメント先行を持っている。これらのトークンをストリーム中に再導入することで、ADAはモデルに有害性を再評価し、世代毎に拒絶を回復させる。さまざまなオープンソースモデルファミリ(Llama、Gemma、Mistral、Qwen、DeepSeek、gpt-oss)にわたって、ADAはベースモデルのパラメータを変更することなく堅牢な安全性能を達成する。数十から数千のトークンにわたる敵のプリフィル攻撃に対して、約100%の拒絶率を確保している。さらに、ADAは、GCG、AutoDAN、PAIR、TAPなどの顕著な敵の攻撃の平均成功率を3%以下に下げる。これはすべて、最小限のオーバーリフレクションで良質なタスクでユーティリティを保ちながら達成される。 ADAは、ベースモデルがその後の命令チューニング(良性または逆性)を実行した後でも、このレジリエンスを維持している。

論文の概要: Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

関連論文リスト