Fugu-MT 論文翻訳(概要): Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

論文の概要: Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

arxiv url: http://arxiv.org/abs/2605.26526v1
Date: Tue, 26 May 2026 04:18:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.617995
Title: Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks
Title（参考訳）: オープンウェイトLDMファインチューニングディフェンスは簡単な攻撃に耐えられる
Authors: Kevin Kuo, Chhavi Yadav, Virginia Smith,
Abstract要約: オープンウェイトセーフガードは,これらのセーフガードに対して体系的に評価されていない,より単純な戦略に影響を受けやすいことを示す。本研究は,聴力に基づく訓練を取り入れた聴力耐性チューニング(ART)について紹介する。
参考スコア（独自算出の注目度）: 21.29943620687951
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent defenses for safeguarding open-weight large language models (LLMs) are intended to prevent adversarial usage. Underlying these defenses is an assumption that new harmful behavior is learned through fine-tuning rather than elicited by jailbreaking the model. Yet, pretrained LLMs already encode substantial harmful knowledge across many domains, which raises an important question: can an adversary jailbreak safeguarded models, to achieve harmful usage without fine-tuning at all? In this paper, we show that open-weight safeguards are susceptible to simpler strategies that, despite being well known, have not been systematically evaluated against these safeguards. Specifically, we evaluate two low-cost attacks--abliteration and prefilling--that do not rely on gradient-based optimization. Across three harmfulness evaluation benchmarks (BeaverTails, HarmBench, and AdvBench), these attacks increase attack success rates against safeguarded open-weight models from below 10\% to a range of 16%-96%. To mitigate this vulnerability, we introduce abliteration-resistant tuning (ART), which incorporates an abliteration-based objective into training. ART can be layered onto existing defenses and reduces the success rates of abliteration, prefilling, and their combination by 10%-20%. These findings indicate that the attack surface for open-weight models is broader than previously characterized, and that evaluations of safeguarding defenses should incorporate a more diverse set of attack strategies beyond adversarial fine-tuning.
Abstract（参考訳）: オープンウェイトな大規模言語モデル(LLM)の保護のための最近の防衛策は、敵対的使用を防止することを目的としている。これらの防御の根底には、新しい有害な行動は、モデルをジェイルブレイクすることによって引き起こされるのではなく、微調整によって学習されるという仮定がある。しかし、事前訓練されたLLMは、既に多くのドメインでかなりの有害な知識をコード化しており、これは重要な疑問を提起している。本稿では,オープンウェイトセーフガードは,広く知られているにもかかわらず,これらのセーフガードに対して体系的に評価されていない,より単純な戦略の影響を受けることを示す。具体的には、勾配に基づく最適化に頼らない2つの低コストな攻撃 - 可読化とプリフィル化 - を評価する。これらの攻撃は、3つの有害性評価ベンチマーク(BeaverTails、HarmBench、AdvBench)で、安全で保護されたオープンウェイトモデルに対する攻撃成功率を10%以下から16%～96%に引き上げた。この脆弱性を緩和するために,学習に読み書きベースの目的を取り入れた文字読み上げ耐性チューニング(ART)を導入する。 ARTは既存の防衛に階層化され、消耗、補充、およびそれらの組み合わせの成功率を10%から20%削減することができる。これらの結果から, オープンウェイトモデルに対する攻撃面は従来よりも広く, 防御対策の評価には, 敵の微調整を超えて, より多様な攻撃戦略を取り入れるべきであることが示唆された。

論文の概要: Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

関連論文リスト