Fugu-MT 論文翻訳(概要): When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

論文の概要: When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

arxiv url: http://arxiv.org/abs/2604.00627v1
Date: Wed, 01 Apr 2026 08:32:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.906786
Title: When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion
Title（参考訳）: 安全モデルが危険にさらされる時 - LLM核融合における爆発的潜伏脆弱性-
Authors: Jiaqing Li, Zhibo Zhang, Shide Zhou, Yuxi Li, Tianlong Yu, Kailong Wang,
Abstract要約: モデルマージは、追加の訓練コストなしで複数の微調整 LLM の特殊能力を組み合わせるための強力な技術として登場した。私たちはTrojanMergeというフレームワークを紹介します。これは、潜伏する悪意のあるコンポーネントをソースモデルに埋め込むフレームワークで、個別に良性のままですが、マージ時にひどいミスアライメントモデルを生成します。
参考スコア（独自算出の注目度）: 15.004295056225002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model merging has emerged as a powerful technique for combining specialized capabilities from multiple fine-tuned LLMs without additional training costs. However, the security implications of this widely-adopted practice remain critically underexplored. In this work, we reveal that model merging introduces a novel attack surface that can be systematically exploited to compromise safety alignment. We present TrojanMerge,, a framework that embeds latent malicious components into source models that remain individually benign but produce severely misaligned models when merged. Our key insight is formulating this attack as a constrained optimization problem: we construct perturbations that preserve source model safety through directional consistency constraints, maintain capabilities via Frobenius directional alignment constraints, yet combine during merging to form pre-computed attack vectors. Extensive experiments across 9 LLMs from 3 model families demonstrate that TrojanMerge, consistently achieves high harmful response rates in merged models while source models maintain safety scores comparable to unmodified versions. Our attack succeeds across diverse merging algorithms and remains effective under various hyperparameter configurations. These findings expose fundamental vulnerabilities in current model merging practices and highlight the urgent need for security-aware mechanisms.
Abstract（参考訳）: モデルマージは、追加の訓練コストなしで複数の微調整 LLM の特殊能力を組み合わせるための強力な技術として登場した。しかし、この広く受け入れられた慣行のセキュリティへの影響は、いまだに過小評価されている。本研究では,モデルマージが新たな攻撃面を導入し,安全アライメントの妥協に系統的に活用できることを明らかにする。私たちはTrojanMergeというフレームワークを紹介します。これは、潜伏する悪意のあるコンポーネントをソースモデルに埋め込むフレームワークで、個別に良性のままですが、マージ時にひどいミスアライメントモデルを生成します。我々は、方向整合性制約を通じてソースモデルの安全性を保ち、フロベニウスの方向整合性制約を介して機能を維持するとともに、マージ時に結合して事前計算された攻撃ベクトルを形成する摂動を構築します。 3つのモデルファミリーの9つのLLMにわたる大規模な実験により、TrojanMergeはマージモデルにおいて常に高い有害応答率を達成する一方、ソースモデルは修正されていないバージョンに匹敵する安全性スコアを維持していることが示された。我々の攻撃は、様々なマージアルゴリズムにまたがって成功し、様々なハイパーパラメータ構成の下で有効である。これらの発見は、現在のモデルマージプラクティスにおける根本的な脆弱性を明らかにし、セキュリティ対応メカニズムの緊急性を強調している。

論文の概要: When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

関連論文リスト