Fugu-MT 論文翻訳(概要): Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

論文の概要: Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

arxiv url: http://arxiv.org/abs/2603.06727v1
Date: Fri, 06 Mar 2026 02:54:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:12.994897
Title: Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment
Title（参考訳）: Safe Transformer: 解釈および制御可能なアライメントのための明示的な安全ビット
Authors: Jingyuan Feng, Andrew Gambardella, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo,
Abstract要約: 我々は,事前学習型言語モデルを拡張するモジュール型アプローチであるSafe Transformerを提案する。安全ビットは、モデルの安全分類の解釈可能な信号と制御可能なスイッチの両方として機能する。赤チームのベンチマークでは、Safe Transformerがほぼゼロのアタック成功率を達成する。
参考スコア（独自算出の注目度）: 41.47485992177247
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when $s=1$ and refusals when $s=0$ - while additional unsupervised bits $u$ encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information to flow through, preserving the model's generation capabilities. This design achieves both interpretability (the safety decision is directly readable) and controllability (the safety bit can be manually overridden), requiring only lightweight fine-tuning without pre-training from scratch. In red-team benchmarks, Safe Transformer achieves near-zero Attack Success Rate, substantially outperforming base models and safety fine-tuning baselines.
Abstract（参考訳）: 現在の安全アライメント手法は、モデルパラメータ内で暗黙的に安全な振る舞いを符号化し、根本的な不透明さを生み出す。本稿では,トランス層間の明示的な安全ビットを含む離散的な情報ボトルネックを挿入することにより,事前学習した言語モデルを強化するモジュール方式であるSafe Transformerを提案する。モデルの安全分類の解釈可能な信号と制御可能なスイッチの両方として機能する: 対照的なトレーニングにより、モデルは、安全ビットが行動モードを統制する非絡み合った表現を学習する。情報ボトルネックに追加の教師なしビットは、セマンティック情報を流れることができ、モデルの生成能力を保っている。この設計は、解釈可能性(安全性決定は直接可読性)と制御性(安全ビットを手動でオーバーライドできる)の両方を実現し、スクラッチから事前学習することなく、軽量な微調整しか必要としない。 Red-teamベンチマークでは、Safe Transformerがほぼゼロに近いアタック成功率を実現し、ベースモデルと安全な微調整ベースラインを大幅に上回っている。

論文の概要: Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

関連論文リスト