Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL
- URL: http://arxiv.org/abs/2506.22401v1
- Date: Fri, 27 Jun 2025 17:18:43 GMT
- Title: Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL
- Authors: Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi,
- Abstract summary: Online reinforcement learning (RL) with complex function approximations plays a significant role in the modern practice of artificial intelligence.<n> balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge.<n>This paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization.
- Score: 40.05960121330012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practical schemes that are backed by theoretical performance guarantees. Motivated by recent developments in exploration via optimistic regularization, this paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization. From this fresh perspective, we set forth a new value-incentivized actor-critic (VAC) method, which optimizes a single easy-to-optimize objective integrating exploration and exploitation -- it promotes state-action and policy estimates that are both consistent with collected data transitions and result in higher value functions. Theoretically, the proposed VAC method has near-optimal regret guarantees under linear Markov decision processes (MDPs) in both finite-horizon and infinite-horizon settings, which can be extended to the general function approximation setting under appropriate assumptions.
Related papers
- Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z) - Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization [14.320131946691268]
We propose an easy-to-use and theoretically sound fine-tuning method for flow-based generative models.<n>By introducing an online rewardweighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold.<n>Our method achieves optimal policy convergence while allowing controllable trade-offs between reward and diversity.
arXiv Detail & Related papers (2025-02-09T22:45:15Z) - Nonmyopic Global Optimisation via Approximate Dynamic Programming [14.389086937116582]
We introduce novel nonmyopic acquisition strategies tailored to IDW- and RBF-based global optimisation.<n>Specifically, we develop dynamic programming-based paradigms, including rollout and multi-step scenario-based optimisation schemes.
arXiv Detail & Related papers (2024-12-06T09:25:00Z) - Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning [17.245293915129942]
The optimal objective is a fundamental aspect of reinforcement learning (RL)<n>While total return is ideal, discounted return is practical objective due to its stability.<n>We propose two alternative approaches to align the objectives.
arXiv Detail & Related papers (2024-07-18T08:33:10Z) - Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO)<n>VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function.<n>Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z) - Efficient Reinforcement Learning via Decoupling Exploration and Utilization [6.305976803910899]
Reinforcement Learning (RL) has achieved remarkable success across multiple fields and applications, including gaming, robotics, and autonomous vehicles.
In this work, our aim is to train agent with efficient learning by decoupling exploration and utilization, so that agent can escaping the conundrum of suboptimal Solutions.
The above idea is implemented in the proposed OPARL (Optimistic and Pessimistic Actor Reinforcement Learning) algorithm.
arXiv Detail & Related papers (2023-12-26T09:03:23Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Scalable PAC-Bayesian Meta-Learning via the PAC-Optimal Hyper-Posterior:
From Theory to Practice [54.03076395748459]
A central question in the meta-learning literature is how to regularize to ensure generalization to unseen tasks.
We present a generalization bound for meta-learning, which was first derived by Rothfuss et al.
We provide a theoretical analysis and empirical case study under which conditions and to what extent these guarantees for meta-learning improve upon PAC-Bayesian per-task learning bounds.
arXiv Detail & Related papers (2022-11-14T08:51:04Z) - Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning.
We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing.
We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z) - Sequential Information Design: Markov Persuasion Process and Its
Efficient Reinforcement Learning [156.5667417159582]
This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs)
Planning in MPPs faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender.
We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles.
arXiv Detail & Related papers (2022-02-22T05:41:43Z) - Towards Hyperparameter-free Policy Selection for Offline Reinforcement
Learning [10.457660611114457]
We show how to select between policies and value functions produced by different training algorithms in offline reinforcement learning.
We use BVFT [XJ21], a recent theoretical advance in value-function selection, and demonstrate their effectiveness in discrete-action benchmarks such as Atari.
arXiv Detail & Related papers (2021-10-26T20:12:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.