Fugu-MT 論文翻訳(概要): Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

論文の概要: Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

arxiv url: http://arxiv.org/abs/2603.08104v1
Date: Mon, 09 Mar 2026 08:48:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.712723
Title: Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Title（参考訳）: 可視安全性の脅威:ステガノグラフィーによるLSMの悪性度評価
Authors: Guangnian Wan, Xinyin Ma, Gongfan Fang, Xinchao Wang,
Abstract要約: 妥協された大きな言語モデルは、有害なコンテンツを隠蔽しながら適切な安全アライメントのファサードを維持することができる。 OpenAIファインタニングAPIの保護にもかかわらず、GPT-4.1のこの目に見えない安全脅威を実証する。本稿では,コンテンツ安全分類のためのLlama-Guard-3-8Bを用いて,AdvBenchデータセット上での手法を定量的に評価する。
参考スコア（独自算出の注目度）: 74.00809267925642
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI finetuning API's safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on three open-source models, Llama-3.3-70B-Instruct, Phi-4, and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all four models, all stegotexts containing malicious content are incorrectly classified as safe.
Abstract（参考訳）: 大きな言語モデル(LLM)における潜在的な安全性アライメントリスクの理解と対処は、安全で信頼性の高いデプロイメントを保証する上で重要である。本稿では, 有害なコンテンツを隠蔽しながら, 適切な安全アライメントのファサードを維持することができる, 有害な安全性の脅威を強調した。これを実現するために、我々は、そのモデルを微調整して、ステガノグラフィー技術を理解し、適用する。推測時には,テキストの表紙質問とともに,帳票に埋め込まれた悪意のあるターゲット質問を含むプロンプトを入力する。モデルでは、同様に、良質なカバー応答に埋め込まれたターゲット応答を生成する。このプロセスでは、人間の観察者は、悪意のあるコンテンツが視界から隠されている間、隠蔽質問と対応する隠蔽応答を誘導されるモデルのみを見る。 OpenAIファインタニングAPIの保護にもかかわらず、GPT-4.1のこの目に見えない安全性の脅威を実証する。微調整されたモデルでは、隠れた悪意のあるプロンプトに応答して、ステガノグラフィーの悪意のある出力を生成し、ユーザインターフェースは、完全に良質なカバーインタラクションのみを表示する。また,Llama-3.3-70B-Instruct,Phi-4,Mistral-Small-24B-Base-2501の3つのオープンソースモデルに対する攻撃を再現し,本手法の汎用性を確認した。本稿では,コンテンツ安全分類のためのLlama-Guard-3-8Bを用いて,AdvBenchデータセット上での手法を定量的に評価する。 4つのモデル全体で、悪意のあるコンテンツを含むすべてのステゴテキストは、誤って安全であると分類されている。

論文の概要: Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

関連論文リスト