Fugu-MT 論文翻訳(概要): Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

論文の概要: Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

arxiv url: http://arxiv.org/abs/2508.09288v2
Date: Mon, 18 Aug 2025 18:20:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 13:30:22.869897
Title: Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs
Title（参考訳）: AIは秘密を守ることができるか? コンテキスト統合検証: LLMの予測可能なセキュリティアーキテクチャ
Authors: Aayush Gupta,
Abstract要約: 我々は、暗号的に署名されたラベルを全てのトークンにアタッチする、既定のセキュリティアーキテクチャであるContextual Integrity Verification (CIV)を提示する。 CIVは、凍結したモデルに対して、前兆かつトーケン毎の非干渉保証を提供する。 Llama-3-8BとMistral-7Bのドロップイン保護を実証した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch -- no fine-tuning required -- we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.
Abstract（参考訳）: 大規模言語モデル(LLM)は、迅速な注入と関連するジェイルブレイク攻撃に対して深刻な脆弱性を保ち、ヒューリスティックガードレール(ルール、フィルタ、LLM審査員)は日常的にバイパスされる。 CIV(Contextual Integrity Verification)は、暗号的に署名された証明ラベルを各トークンにアタッチし、ソフトマックス前のハードアテンションマスク(オプションFFN/残留ゲーティング)を介してトランスフォーマー内のソーストラスト格子を強制する推論時セキュリティアーキテクチャである。 CIVは、凍結モデルに対して決定論的かつトーケン毎の非干渉保証を提供する: 低いトラストトークンはより高いトラスト表現に影響を与えることができない。最近のインジェクションベクトルの分類から得られたベンチマーク(Elite-Attack + SoK-246)では、CIVは93.1%のトークンレベルの類似性を保ちながら攻撃成功率0%に達し、良性タスクにおけるモデルパープレキシティの劣化を示さない。 CIVは軽量なパッチ(微調整不要)であるため、Llama-3-8BとMistral-7Bのドロップイン保護を実証する。我々は、再現可能な研究を支援するためのリファレンス実装、自動認証ハーネス、Elite-Attackコーパスをリリースする。

論文の概要: Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

関連論文リスト