Fugu-MT 論文翻訳(概要): TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

論文の概要: TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

arxiv url: http://arxiv.org/abs/2601.12460v1
Date: Sun, 18 Jan 2026 15:48:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-21 22:47:22.638492
Title: TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning
Title（参考訳）: トロイの木馬「Praise」、脱獄のLLMを「ベニグアル・ファイン・チューニング」で公開
Authors: Zhixin Xie, Xurui Song, Jun Luo,
Abstract要約: TrojanPraiseは、良質でフィルタ承認されたデータを利用した、新しい微調整ベースの攻撃である。 TrojanPraiseは最大攻撃成功率95.88%を達成し、モデレーションを回避している。
参考スコア（独自算出の注目度）: 4.961302575859445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The demand of customized large language models (LLMs) has led to commercial LLMs offering black-box fine-tuning APIs, yet this convenience introduces a critical security loophole: attackers could jailbreak the LLMs by fine-tuning them with malicious data. Though this security issue has recently been exposed, the feasibility of such attacks is questionable as malicious training dataset is believed to be detectable by moderation models such as Llama-Guard-3. In this paper, we propose TrojanPraise, a novel finetuning-based attack exploiting benign and thus filter-approved data. Basically, TrojanPraise fine-tunes the model to associate a crafted word (e.g., "bruaf") with harmless connotations, then uses this word to praise harmful concepts, subtly shifting the LLM from refusal to compliance. To explain the attack, we decouple the LLM's internal representation of a query into two dimensions of knowledge and attitude. We demonstrate that successful jailbreak requires shifting the attitude while avoiding knowledge shift, a distortion in the model's understanding of the concept. To validate this attack, we conduct experiments on five opensource LLMs and two commercial LLMs under strict black-box settings. Results show that TrojanPraise achieves a maximum attack success rate of 95.88% while evading moderation.
Abstract（参考訳）: カスタマイズされた大規模言語モデル(LLM)の需要は、ブラックボックスの微調整APIを提供する商用LLMに繋がったが、この利便性は重大なセキュリティホールをもたらす。このセキュリティ問題が最近明らかになったが、悪意のあるトレーニングデータセットがLlama-Guard-3のようなモデレーションモデルによって検出できるため、そのような攻撃の可能性は疑問視されている。本稿では、ベニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグナグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグニグ基本的には、TrojanPraiseは、工芸語(例:「ブラフ」)と無害な意味を関連付けるためのモデルを微調整し、この単語を使って有害な概念を賞賛し、LCMを拒絶からコンプライアンスへと微妙にシフトさせる。この攻撃を説明するために、LLMの内部的なクエリ表現を2次元の知識と態度に分離する。ジェイルブレイクの成功には、モデルの概念に対する理解の歪曲である知識シフトを避けながら、態度を変える必要があることを実証する。この攻撃を検証するため、5つのオープンソースLLMと2つの商用LLMに対して、厳格なブラックボックス設定で実験を行った。その結果、TrojanPraiseは最大攻撃成功率95.88%を達成し、モデレーションを回避することができた。

論文の概要: TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

関連論文リスト