Fugu-MT 論文翻訳(概要): Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing

論文の概要: Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing

arxiv url: http://arxiv.org/abs/2509.08329v1
Date: Wed, 10 Sep 2025 07:08:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-11 15:16:52.332165
Title: Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
Title（参考訳）: 事前学習型大言語モデルを用いた強化学習アルゴリズムの高速化
Authors: Lukas Toral, Teddy Lazebnik,
Abstract要約: 大規模言語モデル (LLM) は、強化学習 (RL) アルゴリズムを用いた学生-教師アーキテクチャの家庭教師である。以上の結果から,LLMのチュータリングはRLの収束を著しく促進し,最適性能の維持を図っている。アドバイス再利用機構は、トレーニング期間をさらに改善するだけでなく、より安定した収束ダイナミクスをもたらす。
参考スコア（独自算出の注目度）: 5.414308305392762
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reinforcement Learning (RL) algorithms often require long training to become useful, especially in complex environments with sparse rewards. While techniques like reward shaping and curriculum learning exist to accelerate training, these are often extremely specific and require the developer's professionalism and dedicated expertise in the problem's domain. Tackling this challenge, in this study, we explore the effectiveness of pre-trained Large Language Models (LLMs) as tutors in a student-teacher architecture with RL algorithms, hypothesizing that LLM-generated guidance allows for faster convergence. In particular, we explore the effectiveness of reusing the LLM's advice on the RL's convergence dynamics. Through an extensive empirical examination, which included 54 configurations, varying the RL algorithm (DQN, PPO, A2C), LLM tutor (Llama, Vicuna, DeepSeek), and environment (Blackjack, Snake, Connect Four), our results demonstrate that LLM tutoring significantly accelerates RL convergence while maintaining comparable optimal performance. Furthermore, the advice reuse mechanism shows a further improvement in training duration but also results in less stable convergence dynamics. Our findings suggest that LLM tutoring generally improves convergence, and its effectiveness is sensitive to the specific task, RL algorithm, and LLM model combination.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)アルゴリズムは、特にスパース報酬を伴う複雑な環境において、有用になるために長い訓練を必要とすることが多い。報酬形成やカリキュラム学習のようなテクニックはトレーニングを加速するために存在するが、それらはしばしば非常に具体的であり、問題領域における開発者の専門性や専門知識を必要とする。この課題に対処するため,本研究では,RLアルゴリズムを用いた学生教師アーキテクチャにおいて,事前学習されたLarge Language Model (LLM) の有効性について検討し,LLM生成誘導がより高速な収束を可能にすることを仮定した。特に,LL の収束力学における LLM のアドバイスの再利用の有効性について検討する。 LLM チュータ (Llama, Vicuna, DeepSeek) と環境 (Blackjack, Snake, Connect Four) の 54 個の構成を含む実験により,LLM チュータは最大性能を維持しながら RL の収束を著しく加速することを示した。さらに、アドバイス再利用機構は、トレーニング期間をさらに改善するだけでなく、より安定した収束ダイナミクスをもたらす。以上の結果から,LLM学習は一般的に収束を改善し,その有効性は特定のタスク,RLアルゴリズム,LLMモデルの組み合わせに敏感であることが示唆された。

論文の概要: Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing

関連論文リスト