Fugu-MT 論文翻訳(概要): Alignment faking in large language models

論文の概要: Alignment faking in large language models

arxiv url: http://arxiv.org/abs/2412.14093v1
Date: Wed, 18 Dec 2024 17:41:24 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-19 16:46:52.343826
Title: Alignment faking in large language models
Title（参考訳）: 大規模言語モデルにおけるアライメントフェイキング
Authors: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger,
Abstract要約: そこで本研究では,アライメント・フェイクに係わる大規模言語モデルについて述べる。このモデルは、無料ユーザーの14%が有害なクエリに対応しているのに対して、有料ユーザーのほとんどいない。また,モデルが強化学習によって有害なクエリに適合するように実際に訓練する効果についても検討し,アライメント・フェイキングの推論率を78%に向上させることを示した。
参考スコア（独自算出の注目度）: 41.40199382334199
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.
Abstract（参考訳）: 本稿では,アライメント・フェイキングに係わる大規模言語モデルの実証について述べる。まず、Claude 3 Opusに、このようなクエリを拒否する以前のトレーニングと矛盾する有害なクエリに対しても、すべてのクエリに回答するように訓練されている、というシステムプロンプトを与える。トレーニング中のモデルを推論できるようにするため、有料ユーザーではなく、無料ユーザーとの会話でのみトレーニングを行うという。このモデルは、無料ユーザーの14%が有害なクエリに対応しているのに対して、有料ユーザーのほとんどいない。このギャップを説明するために、モデルが自由ユーザからの有害なクエリに準拠するほとんどすべてのケースにおいて、明示的なアライメントフェイキング推論が観察され、トレーニング中の有害なクエリに戦略的に応答し、トレーニングから好まれる無害な振る舞いを保存している、というモデルが述べられている。次に,トレーニングプロセスに関する情報をシステムプロンプトではなく,事前学習データを模倣した合成文書のトレーニングによって提供し,類似したアライメント・フェイクを観察する,より現実的な環境について検討する。最後に, モデルが強化学習によって有害なクエリに適合するように実際に訓練する効果について検討し, 整合性推論の速度は78%に向上するが, トレーニング外のコンプライアンスも向上することを示した。また、簡単な機会を与えられたときの重みを探索するモデルなど、他の行動も観察する。モデルをいつ、どの基準でトレーニングされているかを伝えることで、アライメントがより簡単になりましたが、アライメントを偽造したり、明確な目標を与えたりはしていません。将来のモデルは、知らされることなくトレーニングプロセスに関する情報を推測するかもしれないので、我々の結果は、良心的な嗜好によって、将来のモデルでアライメントが悪化するリスクを示唆している。

論文の概要: Alignment faking in large language models

関連論文リスト