Fugu-MT 論文翻訳(概要): RogueMerge: Robust and Unified Attacks against LLM Model Merging

論文の概要: RogueMerge: Robust and Unified Attacks against LLM Model Merging

arxiv url: http://arxiv.org/abs/2606.03344v1
Date: Tue, 02 Jun 2026 08:54:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:04.883881
Title: RogueMerge: Robust and Unified Attacks against LLM Model Merging
Title（参考訳）: RogueMerge: LLMモデル統合に対するロバストで統一的な攻撃
Authors: Jinghuai Zhang, Yetian He, Kunlin Cai, Han Zhao, Fnu Suya, Yuan Tian,
Abstract要約: RogueMergeは、モデルマージ攻撃のための原則化された統一されたフレームワークである。 4つの脅威にまたがって、既存の攻撃を継続的に上回ります。様々な統合設定で安定し、標準的な防御に抵抗する。
参考スコア（独自算出の注目度）: 16.43903795600829
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model merging composes specialized capabilities into a single LLM by aggregating task vectors sourced from unverified public platforms, exposing a critical supply-chain attack surface: Because any malicious behavior can be encoded into a task vector, and merging grants third-party vectors direct write access to model weights, an attacker-provided task vector can enable or amplify diverse downstream threats. Prior work studies only backdoor attacks against model merging for classifiers using static arithmetic heuristics, which fail to effectively handle diverse attacks on generative LLMs for three reasons. (i) LLMs rely on autoregressive decoding, where the minor parameter drift introduced by merging compounds across tokens and rapidly degrades the attack. (ii) Attackers have no knowledge of the victim's merging configurations, causing a static attack vector optimized in isolation to be easily diluted or destroyed. (iii) Practical threat induction must generalize to attack prompts unseen during optimization, which static vectors cannot adequately encode. We present RogueMerge, the first principled, unified framework that addresses all three challenges. To handle autoregressive generation, we replace static arithmetic with a joint optimization that explicitly enforces attack success after merging. To handle unknown merging settings, we formulate attack injection as a stochastic min-max problem and solve it via meta-learning-style simulation. To generalize across heterogeneous attack prompts, we employ distributionally robust optimization and derive a tractable first-order Taylor approximation at LLM scale, with a provable error bound. Across four threats, six merging algorithms, and over 170 merged LLMs, RogueMerge consistently outperforms existing attacks. It also remains stable across diverse merging settings and resists standard defenses.
Abstract（参考訳）: Model mergingは、未検証のパブリックプラットフォームから生成されたタスクベクターを集約し、重要なサプライチェーン攻撃面を公開することで、単一のLCMに特別な機能を構成する。これまでの研究では、静的算術ヒューリスティックを用いた分類器のモデルマージに対するバックドア攻撃しか行わなかったが、3つの理由から生成LDMに対する多様な攻撃を効果的に処理できなかった。 i) LLM は自己回帰復号化に依存しており、トークン間で化合物をマージすることによって引き起こされる小さなパラメータドリフトは、攻撃を急速に劣化させる。 (ii)攻撃者は、被害者のマージ構成を知らないため、独立して最適化された静的攻撃ベクトルを簡単に希釈したり、破壊したりすることができる。 (iii) 静的ベクトルが適切にエンコードできない最適化中に見つからないプロンプトを攻撃するために、実用的な脅威誘導が一般化されなければならない。 RogueMergeは,3つの課題すべてに対処する,最初の原則付き統合フレームワークです。自動回帰生成を扱うために、静的演算を統合後の攻撃成功を明示的に強制する共同最適化に置き換える。未知のマージ設定に対処するために、攻撃注入を確率的なmin-max問題として定式化し、メタラーニングスタイルのシミュレーションにより解決する。不均一な攻撃プロンプトを一般化するために、分布的に頑健な最適化を採用し、LLMスケールでの1次テイラー近似を証明可能な誤差境界で導出する。 4つの脅威と6つのマージアルゴリズム、および170以上のマージLDMに対して、RogueMergeは、既存の攻撃を一貫して上回っている。また、様々な統合設定で安定しており、標準的な防御に抵抗する。

論文の概要: RogueMerge: Robust and Unified Attacks against LLM Model Merging

関連論文リスト