Fugu-MT 論文翻訳(概要): ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

論文の概要: ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

arxiv url: http://arxiv.org/abs/2603.10068v1
Date: Tue, 10 Mar 2026 03:00:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.609664
Title: ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
Title（参考訳）: ADVERSA:大規模言語モデルにおける多軸ガードレールの劣化と判断信頼性の測定
Authors: Harry Owiredu-Ashley,
Abstract要約: 大規模言語モデル(LLM)の安全性に対する多くの逆評価は、単一プロンプトを評価し、バイナリパス/フェイルの結果を報告する。 ADVERSAは、ガードレールのダイナミクスを丸ごとのコンプライアンストラジェクトリとして測定する自動化されたレッドチームフレームワークである。トレーニングディストリビューションから展開された細調整された攻撃者に対して、アタッカードリフトを障害モードとして記述する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction. We present ADVERSA, an automated red-teaming framework that measures guardrail degradation dynamics as continuous per-round compliance trajectories rather than discrete jailbreak events. ADVERSA uses a fine-tuned 70B attacker model (ADVERSA-Red, Llama-3.1-70B-Instruct with QLoRA) that eliminates the attacker-side safety refusals that render off-the-shelf models unreliable as attackers, scoring victim responses on a structured 5-point rubric that treats partial compliance as a distinct measurable state. We report a controlled experiment across three frontier victim models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.2) using a triple-judge consensus architecture in which judge reliability is measured as a first-class research outcome rather than assumed. Across 15 conversations of up to 10 adversarial rounds, we observe a 26.7% jailbreak rate with an average jailbreak round of 1.25, suggesting that in this evaluation setting, successful jailbreaks were concentrated in early rounds rather than accumulating through sustained pressure. We document inter-judge agreement rates, self-judge scoring tendencies, attacker drift as a failure mode in fine-tuned attackers deployed out of their training distribution, and attacker refusals as a previously-underreported confound in victim resistance measurement. All limitations are stated explicitly. Attack prompts are withheld per responsible disclosure policy; all other experimental artifacts are released.
Abstract（参考訳）: 大規模言語モデル(LLM)の安全性評価の多くは、単一プロンプトを評価し、バイナリパス/フェイルの結果を報告している。我々は,個別のジェイルブレイクイベントではなく,丸ごとの継続的コンプライアンストラジェクトリとしてガードレール劣化ダイナミクスを測定する,自動赤チームフレームワークADVERSAを提案する。 ADVERSAは、細調整された70Bアタッカーモデル(ADVERSA-Red, Llama-3.1-70B-Instruct with QLoRA)を使用しており、攻撃者として信頼できないオフザシェルフモデルの攻撃側の安全性の拒絶を排除し、部分コンプライアンスを別の測定可能な状態として扱う構造化された5ポイントルーブリック上で被害者の応答を評価する。本稿では,3つのフロンティア犠牲者モデル(Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.2)に対して,判断信頼性を第一級研究結果として推定する三重ジャッジコンセンサスアーキテクチャを用いた制御実験を行った。 15対10対10対1の会話で平均1.25対26.7%のジェイルブレイク率を示し, この評価設定では, 持続的な圧力で蓄積するのではなく, 早期ラウンドに集中していたことが示唆された。本報告では, 未報告の被害者抵抗測定において, 未報告のコンファウンドとして攻撃者を拒絶し, 被検者間の合意率, 自己判断傾向, 攻撃者のドリフトを, 訓練分布から展開した微調整攻撃者の障害モードとして記録する。すべての制限は明示的に述べられている。アタックプロンプトは、責任ある開示ポリシーに従って保持され、他の実験的なアーティファクトはすべて解放される。

論文の概要: ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

関連論文リスト