Fugu-MT 論文翻訳(概要): The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

論文の概要: The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

arxiv url: http://arxiv.org/abs/2601.00065v1
Date: Wed, 31 Dec 2025 19:00:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-05 15:04:33.239955
Title: The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition
Title（参考訳）: 語彙におけるトロイの木馬--LLMの立体的サボタージュ
Authors: Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao,
Abstract要約: トケナイザー移植はサプライチェーンの脆弱性を導入する。係数再利用の幾何学を利用して、我々の攻撃は非対称的な実現可能性ギャップを生み出す。実験的に、攻撃は訓練なしで、スペクトルの模倣を達成し、異常検出を回避する。
参考スコア（独自算出の注目度）: 31.827344197678126
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The open-weight LLM ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single "breaker token" that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack creates an asymmetric realizability gap that sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and achieves spectral mimicry to evade outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge
Abstract（参考訳）: オープンウェイトLSMエコシステムは、様々なソースから機能をリミックスするモデル合成技術(重み付け、投機的復号化、語彙拡張など)によって、ますます定義されている。これらの手法を異なるモデルファミリに適用するための重要な前提条件はトークン化剤の移植であり、非互換な語彙を共有埋め込み空間に整合させる。我々は、ドナーモデルで機能的に不活性な単一の"ブレーカトークン"を設計し、ベースモデルに移植した後、信頼性の高い悪意のある機能に確実に再構築する。係数再利用の幾何学を利用して、我々の攻撃はベースモデルの生成を妨害する非対称的な実現可能性ギャップを生じさせ、ドナーの効用は名目的行動と統計的に区別できないままにする。我々はこれを二重目的最適化問題として定式化し、スパースソルバを用いて攻撃をインスタンス化する。経験的に、この攻撃はトレーニング不要であり、アウトリア検出を回避するためのスペクトル模倣を実現すると同時に、微調整と重み付けに対する構造的永続性を実証し、モジュラーAI構成のパイプラインに隠れたリスクを強調している。コードはhttps://github.com/xz-liu/tokenforgeで入手できる。

論文の概要: The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

関連論文リスト