Fugu-MT 論文翻訳(概要): Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

論文の概要: Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

arxiv url: http://arxiv.org/abs/2510.01342v1
Date: Wed, 01 Oct 2025 18:14:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.81477
Title: Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach
Title（参考訳）: 高度に拘束されたブラックボックス設定下での細調整ジェイルブレイク:3段階的アプローチ
Authors: Xiangfang Li, Yu Wang, Bo Li,
Abstract要約: 我々は,データセットのみのブラックボックスファインチューニングインタフェースの下で,プロバイダの防御に対する3段階のジェイルブレイク攻撃を示す。我々の攻撃には、安全スタイルの接頭辞/接尾辞ラッパー、敏感なトークンの良質な語彙エンコーディング(アンダースコーディング)、バックドア機構が組み合わされている。実世界の展開において,本手法はOpenAIプラットフォーム上でGPT-4.1とGPT-4oをジェイルブレークし,攻撃成功率は両モデルともに97%以上である。
参考スコア（独自算出の注目度）: 7.605338172738699
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid advancement of large language models (LLMs), ensuring their safe use becomes increasingly critical. Fine-tuning is a widely used method for adapting models to downstream tasks, yet it is vulnerable to jailbreak attacks. However, most existing studies focus on overly simplified attack scenarios, limiting their practical relevance to real-world defense settings. To make this risk concrete, we present a three-pronged jailbreak attack and evaluate it against provider defenses under a dataset-only black-box fine-tuning interface. In this setting, the attacker can only submit fine-tuning data to the provider, while the provider may deploy defenses across stages: (1) pre-upload data filtering, (2) training-time defensive fine-tuning, and (3) post-training safety audit. Our attack combines safety-styled prefix/suffix wrappers, benign lexical encodings (underscoring) of sensitive tokens, and a backdoor mechanism, enabling the model to learn harmful behaviors while individual datapoints appear innocuous. Extensive experiments demonstrate the effectiveness of our approach. In real-world deployment, our method successfully jailbreaks GPT-4.1 and GPT-4o on the OpenAI platform with attack success rates above 97% for both models. Our code is available at https://github.com/lxf728/tri-pronged-ft-attack.
Abstract（参考訳）: 大規模言語モデル(LLM)の急速な進歩により、安全性の確保がますます重要になっている。ファインチューニングは、下流タスクにモデルを適用するために広く使われている方法であるが、ジェイルブレイク攻撃には弱い。しかし、既存の研究の多くは、過度に単純化された攻撃シナリオに焦点を当てており、現実の防御設定への実践的関連を限定している。このリスクを具体化するために、我々は3段階のジェイルブレイク攻撃を提示し、データセットのみのブラックボックスファインチューニングインタフェースでプロバイダの防御に対して評価する。この設定では、攻撃者はプロバイダにのみ微調整データを送信でき、一方、プロバイダは、(1)プレロードデータフィルタリング、(2)トレーニング時の防衛微調整、(3)トレーニング後の安全監査といったステージにわたって防衛を展開できる。我々の攻撃は、安全スタイルのプレフィックス/接尾辞ラッパー、機密トークンの良質な語彙符号化(アンダースコーディング)、およびバックドア機構を組み合わせることで、個々のデータポイントが無害に見える間に有害な振る舞いを学習できるようにする。大規模な実験は、我々のアプローチの有効性を実証する。実世界の展開において,本手法はOpenAIプラットフォーム上でGPT-4.1とGPT-4oをジェイルブレークし,攻撃成功率は両モデルともに97%以上である。私たちのコードはhttps://github.com/lxf728/tri-pronged-ft- attackで利用可能です。

論文の概要: Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

関連論文リスト