Fugu-MT 論文翻訳(概要): Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

論文の概要: Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

arxiv url: http://arxiv.org/abs/2604.00770v1
Date: Wed, 01 Apr 2026 11:34:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.958005
Title: Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning
Title（参考訳）: 沈黙の誤りを考える:継続的潜在推論に対するバックドアアタック
Authors: Swapnil Parekh,
Abstract要約: 新しい世代の言語モデルは、完全に連続的な隠蔽状態であり、トークンは生成せず、監査証跡も残っていない。 ThoughtSteer は >=99% の攻撃成功率をほぼベースラインのクリーンな精度で達成している。個々の潜在ベクトルは、モデルが間違った解を出力したとしても、正しい解を符号化する。
参考スコア（独自算出の注目度）: 1.3011345529764784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.
Abstract（参考訳）: 新しい世代の言語モデルは、完全に連続的な隠蔽状態であり、トークンは生成せず、監査証跡も残っていない。この沈黙は基本的に新しい攻撃面を生み出すことを示す。 ThoughtSteerは入力層に1つの埋め込みベクトルを摂動する。モデル自身のマルチパス推論は、この摂動をハイジャックされた潜在軌道に増幅し、攻撃者の選択した回答を確実に生成する一方で、トークンレベルの防御には構造的に見えないままである。 2つのアーキテクチャ(CoconutとSimCoT)、3つの推論ベンチマーク、124Mから3Bパラメータのモデルスケール、ThoughtSteerは、ほぼベースラインのクリーニング精度で攻撃成功率 >=99%、リトレーニングなし(94-100%)のホールトアウトベンチマークへの転送、評価された5つのアクティブディフェンスをすべて回避し、クリーンな微調整の25エポックを生き延びる。潜時空間におけるニューラル崩壊は、なぜ防御が失敗するのか、なぜ効果的なバックドアが線形分離可能なシグネチャを残しなければならないのかを説明する(probe AUC>=0.999)。個々の潜在ベクトルは、モデルが間違った解を出力したとしても、正しい解を符号化する。対向情報は単一のベクトルではなく集合軌道であり、連続推論の機械論的解釈性のための新しいレンズとしてバックドア摂動を確立する。コードとチェックポイントが利用可能だ。

論文の概要: Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

関連論文リスト