Related papers: Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing

Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing

URL: http://arxiv.org/abs/2509.08329v1
Date: Wed, 10 Sep 2025 07:08:04 GMT
Title: Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
Authors: Lukas Toral, Teddy Lazebnik,
Abstract summary: Large Language Models (LLMs) as tutors in a student-teacher architecture withReinforcement Learning (RL) algorithms.<n>Our results demonstrate that LLM tutoring significantly accelerates RL convergence while maintaining comparable optimal performance.<n>Advice reuse mechanism shows a further improvement in training duration but also results in less stable convergence dynamics.
Score: 5.414308305392762
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reinforcement Learning (RL) algorithms often require long training to become useful, especially in complex environments with sparse rewards. While techniques like reward shaping and curriculum learning exist to accelerate training, these are often extremely specific and require the developer's professionalism and dedicated expertise in the problem's domain. Tackling this challenge, in this study, we explore the effectiveness of pre-trained Large Language Models (LLMs) as tutors in a student-teacher architecture with RL algorithms, hypothesizing that LLM-generated guidance allows for faster convergence. In particular, we explore the effectiveness of reusing the LLM's advice on the RL's convergence dynamics. Through an extensive empirical examination, which included 54 configurations, varying the RL algorithm (DQN, PPO, A2C), LLM tutor (Llama, Vicuna, DeepSeek), and environment (Blackjack, Snake, Connect Four), our results demonstrate that LLM tutoring significantly accelerates RL convergence while maintaining comparable optimal performance. Furthermore, the advice reuse mechanism shows a further improvement in training duration but also results in less stable convergence dynamics. Our findings suggest that LLM tutoring generally improves convergence, and its effectiveness is sensitive to the specific task, RL algorithm, and LLM model combination.

Related papers

Guiding Exploration in Reinforcement Learning Through LLM-Augmented Observations [0.0]
Large Language Models (LLMs) possess procedural knowledge and reasoning capabilities from text pretraining.<n>We propose a framework that provides LLM-generated action recommendations through augmented observation spaces.
arXiv Detail & Related papers (2025-10-09T19:54:31Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards [50.21528417884747]
We introduce Omni-Thinker, a unified reinforcement learning framework that enhances large language models (LLMs) performance across diverse tasks.<n>Our approach enables consistent optimization across task types and scales RL-based training to subjective domains.<n> Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging.
arXiv Detail & Related papers (2025-07-20T01:50:16Z)
Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z)
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks [110.20297293596005]
Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks.<n>Existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs.<n>We propose a novel RL algorithm, SWEET-RL, that uses a carefully designed optimization objective to train a critic model with access to additional training-time information.<n>Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms.
arXiv Detail & Related papers (2025-03-19T17:55:08Z)
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z)
Reinforcement Learning Enhanced LLMs: A Survey [45.57586245741664]
We will make a systematic review of the most up-to-date state of knowledge on RL-enhanced large language models (LLMs)<n>Specifically, we detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward model-based RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF)
arXiv Detail & Related papers (2024-12-05T16:10:42Z)
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [80.10358123795946]
We develop a framework for building multi-turn RL algorithms for fine-tuning large language models. Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
arXiv Detail & Related papers (2024-02-29T18:45:56Z)
Learning to Optimize for Reinforcement Learning [58.01132862590378]
Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learneds do not work well even in simple RL tasks. Agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training. We show that, although only trained in toy tasks, our learned can generalize unseen complex tasks in Brax.
arXiv Detail & Related papers (2023-02-03T00:11:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.