Fugu-MT 論文翻訳(概要): Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

論文の概要: Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

arxiv url: http://arxiv.org/abs/2605.04065v2
Date: Thu, 07 May 2026 04:49:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 06:56:26.556614
Title: Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
Title（参考訳）: LLMにおける教師なし推論のための適応的アドバンテージシェイピングを用いた自由エネルギー駆動型強化学習
Authors: Yiming Huang, Zhenbo Shi, Xin-Cheng Wen, Jichuan Zeng, Cuiyun Gao, Peiyi Han, Chuanyi Liu,
Abstract要約: 2つの重要なイノベーションに基づいて構築された新しいRLベースのアルゴリズムであるFREIAを紹介する。 FERは、自由エネルギー原則に基づくコンセンサスと探索のバランスを取るために報酬を適合させる。数学的推論タスクでは、FREIAは平均0.5から3.5ポイントの他の手法を上回ります。
参考スコア（独自算出の注目度）: 24.596649747603724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.
Abstract（参考訳）: 大規模言語モデル(LLM)における自己改善を実現するための,有望なパラダイムとして,教師なし強化学習(RL)が登場している。しかし、既存の教師なしのRLベースの手法は、訓練中にモデルの進化する推論能力に適応する能力に欠けることが多い。したがって、これらの手法は、地道的な監督がなければ、政策最適化を誤って行うことができる。この問題に対処するために,1)自由エネルギー駆動リワード(FER)は,自由エネルギー原理に基づくコンセンサスと探索のバランスをとるために報酬を適応する。 2) アダプティブ・アドバンテージ・シェーピング(AAS)は,サンプル報酬の統計的特性に基づいて学習信号を適応的に調整する。 3つの推論タスクにわたる9つのデータセットに関する実証的な評価は、FREIAが他の教師なしRLベースのベースラインよりも優れていることを示している。特に、数学的推論タスクにおいて、FREIAはDeepSeek-R1-Distill-Qwen-1.5Bモデルを用いて、Pass@1の平均0.5から3.5ポイントの他の手法を上回ります。

論文の概要: Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

関連論文リスト