Fugu-MT 論文翻訳(概要): Subliminal Steering: Stronger Encoding of Hidden Signals

論文の概要: Subliminal Steering: Stronger Encoding of Hidden Signals

arxiv url: http://arxiv.org/abs/2604.25783v1
Date: Tue, 28 Apr 2026 15:51:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.933448
Title: Subliminal Steering: Stronger Encoding of Hidden Signals
Title（参考訳）: サブリミナルステアリング:隠れた信号のより強力なエンコーディング
Authors: George Morgulis, John Hewitt,
Abstract要約: サブリミナルラーニング(Subliminal learning)は、一見無害なデータに基づいて微調整することで、行動バイアスを継承する学生言語モデルを記述する。サブリミナル・ステアリング(subliminal steering, サブリミナル・ラーニング)は, 教師のバイアスをシステムプロンプトではなく, 対象サンプルの集合の可能性を最大化するために訓練されたステアリング・ベクターを通じて実施する, サブリミナル・ステアリング(subliminal steering, サブリミナル・ラーニング)の変種である。サブリミナルステアリングは複雑なマルチワードバイアスを伝達するのに対し,先行研究は単一ワード優先に重点を置いており,サブリミナル・トランスファー可能な信号の広い範囲を示している。
参考スコア（独自算出の注目度）: 5.13724383217928
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Subliminal learning describes a student language model inheriting a behavioral bias by fine-tuning on seemingly innocuous data generated by a biased teacher model. Prior work has begun to characterize this phenomenon but leaves open questions about the scope of signals it can transfer, the mechanisms that explain it, and the precision with which a bias can be encoded by seemingly unrelated data. We tackle all three problems by introducing subliminal steering, a variant of subliminal learning in which the teacher's bias is implemented not via a system prompt, as in prior work, but through a steering vector trained to maximize the likelihood of a set of target samples. First, we show that subliminal steering transfers complex multi-word biases, whereas prior work focused on single-word preferences, demonstrating a large scope of subliminally transferrable signals. Second, we provide mechanistic evidence that subliminal learning transfers not only the target behavioral bias, but also the steering vector itself, localized to the layers at which the teacher was steered. Finally, we show that the bias is encoded with surprising precision. We train a new steering vector directly on the subliminally-laden dataset and find that it attains high cosine similarity with the original vector.
Abstract（参考訳）: サブリミナルラーニング(Subliminal learning)とは、教師モデルによって生成された一見無害なデータを微調整することで、行動バイアスを継承する学生言語モデルである。以前の研究は、この現象を特徴づけ始めたが、転送可能な信号の範囲、それを説明するメカニズム、そして、一見無関係なデータによってバイアスが符号化される精度について、未解決の疑問を残している。我々は,教師のバイアスをシステムプロンプト経由でではなく,対象サンプルの集合の可能性を最大化するために訓練されたステアリングベクトルを通じて行う,サブリミナル学習の変種であるサブリミナルステアリング(subliminal steering)を導入することで,これら3つの課題に対処する。まず、サブリミナルステアリングは複雑なマルチワードバイアスを伝達するのに対し、以前の研究はシングルワード優先に重点を置いており、サブリミナル・トランスファー可能な信号の広い範囲を示している。第2に,サブリミナル学習が目的の行動バイアスだけでなく,教師が操った階層に局在するステアリングベクトル自体も伝達する,という機械的証拠を提供する。最後に、バイアスが驚くほどの精度で符号化されていることを示す。我々は、サブリミナルラデンデータセットに直接新しいステアリングベクトルをトレーニングし、元のベクターと高いコサイン類似性が得られることを発見した。

論文の概要: Subliminal Steering: Stronger Encoding of Hidden Signals

関連論文リスト