Efficient Reinforcement Learning with Large Language Model Priors
- URL: http://arxiv.org/abs/2410.07927v1
- Date: Thu, 10 Oct 2024 13:54:11 GMT
- Title: Efficient Reinforcement Learning with Large Language Model Priors
- Authors: Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, Jun Wang,
- Abstract summary: Large language models (LLMs) have recently emerged as powerful general-purpose tools.
We propose treating LLMs as prior action distributions and integrating them into RL frameworks.
We show that incorporating LLM-based action priors significantly reduces exploration and complexity optimization.
- Score: 18.72288751305885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In sequential decision-making (SDM) tasks, methods like reinforcement learning (RL) and heuristic search have made notable advances in specific cases. However, they often require extensive exploration and face challenges in generalizing across diverse environments due to their limited grasp of the underlying decision dynamics. In contrast, large language models (LLMs) have recently emerged as powerful general-purpose tools, due to their capacity to maintain vast amounts of domain-specific knowledge. To harness this rich prior knowledge for efficiently solving complex SDM tasks, we propose treating LLMs as prior action distributions and integrating them into RL frameworks through Bayesian inference methods, making use of variational inference and direct posterior sampling. The proposed approaches facilitate the seamless incorporation of fixed LLM priors into both policy-based and value-based RL frameworks. Our experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios.
Related papers
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.
Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.
We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization [22.67700436936984]
We introduce Direct Advantage Policy Optimization (DAPO), a novel step-level offline reinforcement learning algorithm.
DAPO employs a critic function to predict the reasoning accuracy at each step, thereby generating dense signals to refine the generation strategy.
Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.
arXiv Detail & Related papers (2024-12-24T08:39:35Z) - Regularized Multi-LLMs Collaboration for Enhanced Score-based Causal Discovery [13.654021365091305]
We explore the potential of using large language models (LLMs) to enhance causal discovery approaches.
We propose a general framework to utilise the capacity of not only one but multiple LLMs to augment the discovery process.
arXiv Detail & Related papers (2024-11-27T01:56:21Z) - EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty.
We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications.
Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z) - Empowering Few-Shot Relation Extraction with The Integration of Traditional RE Methods and Large Language Models [48.846159555253834]
Few-Shot Relation Extraction (FSRE) appeals to more researchers in Natural Language Processing (NLP)
Recent emergence of Large Language Models (LLMs) has prompted numerous researchers to explore FSRE through In-Context Learning (ICL)
arXiv Detail & Related papers (2024-07-12T03:31:11Z) - Improve Temporal Awareness of LLMs for Sequential Recommendation [61.723928508200196]
Large language models (LLMs) have demonstrated impressive zero-shot abilities in solving a wide range of general-purpose tasks.
LLMs fall short in recognizing and utilizing temporal information, rendering poor performance in tasks that require an understanding of sequential data.
We propose three prompting strategies to exploit temporal information within historical interactions for LLM-based sequential recommendation.
arXiv Detail & Related papers (2024-05-05T00:21:26Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.