DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning
- URL: http://arxiv.org/abs/2505.03209v1
- Date: Tue, 06 May 2025 05:53:09 GMT
- Title: DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning
- Authors: Borui Wang, Kathleen McKeown, Rex Ying,
- Abstract summary: Reinforcement learning from expert demonstrations has long remained a challenging research problem.<n>Existing state-of-the-art methods using behavioral cloning plus further RL training often suffer from poor generalization, low sample efficiency, and poor model interpretability.<n>We propose a novel strategy-based reinforcement learning framework integrated with large language models (LLMs) to overcome these limitations.
- Score: 27.336254612018404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning from expert demonstrations has long remained a challenging research problem, and existing state-of-the-art methods using behavioral cloning plus further RL training often suffer from poor generalization, low sample efficiency, and poor model interpretability. Inspired by the strong reasoning abilities of large language models (LLMs), we propose a novel strategy-based reinforcement learning framework integrated with LLMs called DYnamic STrategy Induction with Llms for reinforcement learning (DYSTIL) to overcome these limitations. DYSTIL dynamically queries a strategy-generating LLM to induce textual strategies based on advantage estimations and expert demonstrations, and gradually internalizes induced strategies into the RL agent through policy optimization to improve its performance through boosting policy generalization and enhancing sample efficiency. It also provides a direct textual channel to observe and interpret the evolution of the policy's underlying strategies during training. We test DYSTIL over challenging RL environments from Minigrid and BabyAI, and empirically demonstrate that DYSTIL significantly outperforms state-of-the-art baseline methods by 17.75% in average success rate while also enjoying higher sample efficiency during the learning process.
Related papers
- Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z) - Feedback-Induced Performance Decline in LLM-Based Decision-Making [6.5990946334144756]
Large Language Models (LLMs) can extract context from natural language problem descriptions.<n>This paper studies the behaviour of these models within a Markov Decision Process (MDPs)
arXiv Detail & Related papers (2025-07-20T10:38:56Z) - RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism [10.288667305064065]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks.<n>LLMs remain prone to generating hallucinated or outdated responses due to their static internal knowledge.<n>Recent advancements in Retrieval-Augmented Generation (RAG) methods have aimed to enhance models' search and reasoning capabilities.
arXiv Detail & Related papers (2025-06-30T09:02:45Z) - SAGE: Strategy-Adaptive Generation Engine for Query Rewriting [8.941793732446856]
We introduce the Strategy-Adaptive Generation Engine (SAGE), which operationalizes expert-crafted strategies in an reinforcement learning framework.<n>SAGE achieves new state-of-the-art NDCG@10 results, but also uncovers a compelling emergent behavior.<n>Our findings demonstrate that strategy-guided RL, enhanced with nuanced reward shaping, offers a scalable, efficient, and more interpretable paradigm for developing the next generation of robust information retrieval systems.
arXiv Detail & Related papers (2025-06-24T16:50:51Z) - Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning [17.421901873720156]
This paper proposes a novel RL framework called textbfVision-EKIPL.<n>It introduces high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model.<n>It achieves up to a 5% performance improvement on the Reason-RFT-CoT Benchmark compared to the state-of-the-art (SOTA)
arXiv Detail & Related papers (2025-06-07T16:37:46Z) - How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study [16.441081996257576]
This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning strategies can substantially improve reasoning performance.<n>We show that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization.<n>We will open-source our datasets on GitHub and Hugging Face.
arXiv Detail & Related papers (2025-04-01T14:18:38Z) - Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute [55.330813919992465]
This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute.<n>Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths.
arXiv Detail & Related papers (2025-04-01T13:13:43Z) - EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning [69.55982246413046]
We propose explicit policy optimization (EPO) for strategic reasoning.<n>EPO provides strategies in open-ended action space and can be plugged into arbitrary LLM agents to motivate goal-directed behavior.<n> Experiments across social and physical domains demonstrate EPO's ability of long-term goal alignment.
arXiv Detail & Related papers (2025-02-18T03:15:55Z) - Revisiting Robust RAG: Do We Still Need Complex Robust Training in the Era of Powerful LLMs? [69.38149239733994]
We investigate whether complex robust training strategies remain necessary as model capacity grows.<n>We find that as models become more powerful, the performance gains brought by complex robust training methods drop off dramatically.<n>Our findings suggest that RAG systems can benefit from simpler architectures and training strategies as models become more powerful.
arXiv Detail & Related papers (2025-02-17T03:34:31Z) - From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z) - Advancing NLP Models with Strategic Text Augmentation: A Comprehensive
Study of Augmentation Methods and Curriculum Strategies [0.0]
This study conducts a thorough evaluation of text augmentation techniques across a variety of datasets and natural language processing (NLP) tasks.
It examines the effectiveness of these techniques in augmenting training sets to improve performance in tasks such as topic classification, sentiment analysis, and offensive language detection.
arXiv Detail & Related papers (2024-02-14T12:41:09Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Model-based Meta Reinforcement Learning using Graph Structured Surrogate
Models [40.08137765886609]
We show that our model, called a graph structured surrogate model (GSSM), outperforms state-of-the-art methods in predicting environment dynamics.
Our approach is able to obtain high returns, while allowing fast execution during deployment by avoiding test time policy gradient optimization.
arXiv Detail & Related papers (2021-02-16T17:21:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.