Relative Importance Sampling for off-Policy Actor-Critic in Deep Reinforcement Learning
- URL: http://arxiv.org/abs/1810.12558v9
- Date: Mon, 28 Apr 2025 05:15:26 GMT
- Title: Relative Importance Sampling for off-Policy Actor-Critic in Deep Reinforcement Learning
- Authors: Mahammad Humayoo, Gengzhong Zheng, Xiaoqing Dong, Liming Miao, Shuwei Qiu, Zexun Zhou, Peitao Wang, Zakir Ullah, Naveed Ur Rehman Junejo, Xueqi Cheng,
- Abstract summary: Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL)<n>We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning.<n>Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets.
- Score: 32.66049977978746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL). The difference in probability distribution between the target policy ($\pi$) and the behavior policy (b) is a major cause of instability. High variance also originates from distributional mismatch. The variation between the target policy's distribution and the behavior policy's distribution can be reduced using importance sampling (IS). However, importance sampling has high variance, which is exacerbated in sequential scenarios. We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning. To control variance, we alter the value of the smoothness parameter $\beta\in[0, 1]$ in RIS. We develop the first model-free relative importance sampling off-policy actor-critic (RIS-off-PAC) algorithms in RL using this strategy. Our method uses a network to generate the target policy (actor) and evaluate the current policy ($\pi$) using a value function (critic) based on behavior policy samples. Our algorithms are trained using behavior policy action values in the reward function, not target policy ones. Both the actor and critic are trained using deep neural networks. Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Offline RL with No OOD Actions: In-Sample Learning via Implicit Value
Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples.
We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework.
We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z) - Sample Dropout: A Simple yet Effective Variance Reduction Technique in
Deep Policy Optimization [18.627233013208834]
We show that the use of importance sampling could introduce high variance in the objective estimate.
We propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high.
arXiv Detail & Related papers (2023-02-05T04:44:35Z) - Normality-Guided Distributional Reinforcement Learning for Continuous
Control [16.324313304691426]
Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms.
We study the value distribution in several continuous control tasks and find that the learned value distribution is empirical quite close to normal.
We propose a policy update strategy based on the correctness as measured by structural characteristics of the value distribution not present in the standard value function.
arXiv Detail & Related papers (2022-08-28T02:52:10Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Offline Contextual Bandits with Overparameterized Models [52.788628474552276]
We ask whether the same phenomenon occurs for offline contextual bandits.
We show that this discrepancy is due to the emphaction-stability of their objectives.
In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
arXiv Detail & Related papers (2020-06-27T13:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.