Fugu-MT 論文翻訳(概要): Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

論文の概要: Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

arxiv url: http://arxiv.org/abs/2606.11409v1
Date: Tue, 09 Jun 2026 19:59:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.16286
Title: Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
Title（参考訳）: 圧力下のリスク:言語モデルにおける対向ロバスト性の評価
Authors: Malikeh Ehghaghi, Boglárka Ecsedi, Marsha Chechik, Colin Raffel,
Abstract要約: 大規模言語モデル(LLM)のアタック成功率(ASR)を一定のクエリ予算下で報告するのが一般的である。本稿では,累積浮動小数点演算において測定された計算圧力に基づく計算認識評価フレームワークを提案する。
参考スコア（独自算出の注目度）: 22.44538447178147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to ${\approx}5{\times}$ across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.
Abstract（参考訳）: 大規模言語モデル(LLM)の敵対的堅牢性評価は、通常、固定されたクエリ予算の下で攻撃成功率(ASR)を報告し、すべての攻撃を同じようにコストがかかるものとして暗黙的に扱う。実際には、異なる攻撃戦略の計算コストは桁違いに変化する可能性がある。その結果、固定予算でのASRは、モデルをジェイルブレイクするために必要な真の労力を曖昧にすることができ、攻撃のコストが攻撃者への支払いを正当化するかどうかを判断することが困難になる。本稿では,計算圧力をベースとした計算認識評価フレームワークを提案し,計算処理のプロキシとして累積浮動小数点演算(FLOPs)を用いて測定する。リスク計算曲線を導入し、計算予算をマッピングしてリスクを攻撃し、ある攻撃が成功するのに必要な平均圧力をまとめた2つの指標を導出する。 1)アライメントトレーニングは計算空間の堅牢性に非単調な効果をもたらすが、スケールモデルのサイズは勾配ベースの攻撃効率を低下させるが、より安価なテンプレートベースの攻撃に制限される (3)サロゲートモデルに最適化された勾配ベースの攻撃は、個別のターゲットモデルに移行でき、攻撃コストを削減できる (4) 計算コストは、1つのモデル内の有害カテゴリにまたがって最大${\approx}5{\times}まで変化する (5) 安全アライメントRLは、いくつかのカテゴリを扱いながら、集約コストを増加させる。我々は,リスク評価と評価を行うためのフレームワークをリリースする。

論文の概要: Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

関連論文リスト