Fugu-MT 論文翻訳(概要): How's it going? Reinforcement learning in language models recruits a functional welfare axis

論文の概要: How's it going? Reinforcement learning in language models recruits a functional welfare axis

arxiv url: http://arxiv.org/abs/2605.30232v1
Date: Thu, 28 May 2026 17:03:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.56705
Title: How's it going? Reinforcement learning in language models recruits a functional welfare axis
Title（参考訳）: 言語モデルにおける強化学習は機能的福祉軸を推し進める
Authors: Andy Q Han, David J. Chalmers, Pavel Izmailov,
Abstract要約: RLは機能的福祉の既往の表現を取り入れていることを示す。我々は、意味的に中立な迷路環境において、いくつかの言語モデルを訓練する。我々は,この機能的福祉軸がポストトレーニングに先立つことを議論する。
参考スコア（独自算出の注目度）: 7.480328535010549
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.
Abstract（参考訳）: 強化学習は言語モデルの内部表現をどのように形成しますか? 我々は、RLが既存の機能的福祉の表現を採用している証拠を提示する。我々は、意味的に中立な迷路環境において、いくつかの言語モデルを訓練する。次に、報奨・処罰された軌跡に対する概念ベクトルを抽出し、それらのベクトルを迷路環境とは無関係な設定で評価する。罰のベクターは負の福祉の表現のように振る舞う: 失敗と不合理なトークンを促進し、負の感情概念と整合し、目標達成を否定的に追跡し、それとステアリングすることで負の自己申告、病理的なバックトラック、拒絶、不確実性を誘導する。正の報酬ベクトルは鏡像として振る舞うが、2つはほぼ反平行である。これらの効果は、タイル-逆マッピング、スケール、インストラクションチューニング、RLトレーニングアルゴリズム、モデルファミリー、LoRA対フルファインタニングの制御において堅牢であり、RLを教師付き微調整に置き換える場合には、ほとんど持続する。重要なことは、ベクトルは迷路訓練を受ける前にモデルで有効である。プレトレインのみのモデルにも効果が現れるという観察と組み合わせて、この機能的福祉軸はポストトレインではなくポストトレインによって採用される、と論じる。我々は福祉の経験について何の主張もしていないが、この軸は、既存の福祉のような表現を取り入れることで、最小限の報酬信号がモデル行動に広範な影響を与え、解釈可能性、後学習力学、アライメントに影響を及ぼすことを示す。

論文の概要: How's it going? Reinforcement learning in language models recruits a functional welfare axis

関連論文リスト