Fugu-MT 論文翻訳(概要): Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

論文の概要: Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

arxiv url: http://arxiv.org/abs/2510.17000v1
Date: Sun, 19 Oct 2025 20:51:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.240995
Title: Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs
Title（参考訳）: クエリごとのビットリーク:LLMに対する敵対的攻撃に関する情報理論境界
Authors: Masahiro Kaneko, Timothy Baldwin,
Abstract要約: 大きな言語モデル(LLM)の安全性を脅かす悪意のあるユーザによる攻撃は、命令が発行された時点で不明なターゲットプロパティ$T$を推論しようとする試みと見なすことができる。エラー$varepsilon$を達成するには、少なくとも$log (1/varepsilon)/I(Z;T)$クエリが必要で、逆リークレートで線形にスケーリングし、所望の精度で対数的にのみ実行する必要がある。
参考スコア（独自算出の注目度）: 47.12608115550359
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property $T$ that is unknown when an instruction is issued, and becomes knowable only after the model's reply is observed. Examples of target properties $T$ include the binary flag that triggers an LLM's harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions. The LLM reveals an \emph{observable signal} $Z$ that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits. Yet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency--risk trade-off. We fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit. Treating the mutual information $I(Z;T)$ between the observation $Z$ and the target property $T$ as the leaked bits per query, we show that achieving error $\varepsilon$ requires at least $\log(1/\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy. Thus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy. Experiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen. Our results provide the first principled yardstick for balancing transparency and security when deploying LLMs.
Abstract（参考訳）: 大規模言語モデル(LLM)の安全性を脅かす悪意のあるユーザによる敵対攻撃は、命令が発行された時に未知のターゲットプロパティ$T$を推論しようとする試みと見なすことができ、モデルの応答が観察された後にのみ理解できるようになる。ターゲットプロパティの$T$の例としては、LLMの有害な応答や拒否をトリガーするバイナリフラグや、未学習によって削除された情報が復元される度合いがある。 LLM は \emph{observable signal} $Z$ を公開しており、応答トークン、思考プロセストークン、ログインを含む応答を通じて攻撃するヒントをリークする可能性がある。しかし、漏洩した情報の規模は相変わらず逸話的であり、監査役は原則的な指導を受けず、被告は透明性とリスクのトレードオフを無視する。このギャップを情報理論のフレームワークで埋めて、どれだけの情報を安全に開示できるかを計算し、監査人がメソッドが基本的な限界にどれだけ近づいたかを測定する。相互情報$I(Z;T)$とターゲットプロパティ$T$をクエリ毎のリークビットとして扱うと、エラー$\varepsilon$を達成するには少なくとも$\log(1/\varepsilon)/I(Z;T)$クエリが必要です。したがって、開示の緩やかな増加でさえ、所望の精度で攻撃コストを2次から対数に分解する。システムプロンプトリーク、ジェイルブレイク、再学習攻撃を含む7つのLSMの実験は、この理論を裏付けている。 LLMをデプロイする際の透明性とセキュリティのバランスをとるための,最初の原則付きヤードスティックを提供する。

論文の概要: Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

関連論文リスト