Fugu-MT 論文翻訳(概要): Unrewarded Exploration in Large Language Models Reveals Latent Learning from Psychology

論文の概要: Unrewarded Exploration in Large Language Models Reveals Latent Learning from Psychology

arxiv url: http://arxiv.org/abs/2601.22474v1
Date: Fri, 30 Jan 2026 02:39:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.170789
Title: Unrewarded Exploration in Large Language Models Reveals Latent Learning from Psychology
Title（参考訳）: 大規模言語モデルにおける未知の探索 : 心理学からの潜在学習
Authors: Jian Xiong, Jingbo Zhou, Zihan Zhou, Yixiong Xiao, Le Zhang, Jingyong Ye, Rui Qian, Yang Zhou, Dejing Dou,
Abstract要約: 大規模言語モデル(LLM)が潜在学習力学を示すことを示す。 LLMは2段階の探査体制の下で訓練後、報酬に基づく強化学習で訓練後のものよりも高い能力を達成する。
参考スコア（独自算出の注目度）: 41.05763794816626
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Latent learning, classically theorized by Tolman, shows that biological agents (e.g., rats) can acquire internal representations of their environment without rewards, enabling rapid adaptation once rewards are introduced. In contrast, from a cognitive science perspective, reward learning remains overly dependent on external feedback, limiting flexibility and generalization. Although recent advances in the reasoning capabilities of large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, mark a significant breakthrough, these models still rely primarily on reward-centric reinforcement learning paradigms. Whether and how the well-established phenomenon of latent learning in psychology can inform or emerge within LLMs' training remains largely unexplored. In this work, we present novel findings from our experiments that LLMs also exhibit the latent learning dynamics. During an initial phase of unrewarded exploration, LLMs display modest performance improvements, as this phase allows LLMs to organize task-relevant knowledge without being constrained by reward-driven biases, and performance is further enhanced once rewards are introduced. LLMs post-trained under this two-stage exploration regime ultimately achieve higher competence than those post-trained with reward-based reinforcement learning throughout. Beyond these empirical observations, we also provide theoretical analyses for our experiments explaining why unrewarded exploration yields performance gains, offering a mechanistic account of these dynamics. Specifically, we conducted extensive experiments across multiple model families and diverse task domains to establish the existence of the latent learning dynamics in LLMs.
Abstract（参考訳）: トルマンによって古典的に理論化された潜在学習は、生物学的エージェント(例えばラット)が報酬なしで環境の内部表現を取得でき、報酬が導入されたら迅速に適応できることを示している。対照的に、認知科学の観点から見れば、報酬学習は外部からのフィードバックに過度に依存し、柔軟性と一般化を制限している。 OpenAI-o1やDeepSeek-R1のような大規模言語モデル(LLM)の推論能力の最近の進歩は大きなブレークスルーとなったが、これらのモデルは主に報酬中心の強化学習パラダイムに依存している。心理学における潜伏学習の確立された現象が、LLMsのトレーニングの中でどのように通知されるか、どのように現れるかは、いまだに未解明のままである。本研究は,LLMが潜在学習のダイナミクスを示す新たな知見を示すものである。このフェーズでは、LLMは報酬駆動バイアスに制約されることなくタスク関連知識を整理することができ、報酬が導入されたらさらに性能が向上する。この2段階の探査体制の下で訓練後、LLMは最終的に報酬に基づく強化学習で訓練後のものよりも高い能力を達成する。これらの経験的観察の他に、不逆探索がなぜ性能向上をもたらすのかを説明する実験の理論的分析も提供し、これらの力学の力学的な説明を提供する。具体的には、複数のモデルファミリーと多様なタスクドメインにまたがる広範な実験を行い、LLMにおける潜在学習ダイナミクスの存在を確立した。

論文の概要: Unrewarded Exploration in Large Language Models Reveals Latent Learning from Psychology

関連論文リスト