Related papers: ACL-QL: Adaptive Conservative Level in Q-Learning for Offline Reinforcement Learning

ACL-QL: Adaptive Conservative Level in Q-Learning for Offline Reinforcement Learning

URL: http://arxiv.org/abs/2412.16848v1
Date: Sun, 22 Dec 2024 04:18:02 GMT
Title: ACL-QL: Adaptive Conservative Level in Q-Learning for Offline Reinforcement Learning
Authors: Kun Wu, Yinuo Zhao, Zhiyuan Xu, Zhengping Che, Chengxiang Yin, Chi Harold Liu, Qinru Qiu, Feiferi Feng, Jian Tang,
Abstract summary: We propose a framework, Adaptive Conservative Level in Q-Learning (ACL-QL), which limits the Q-values in a mild range.<n>ACL-QL enables adaptive control on the conservative level over each state-action pair, i.e., lifting the Q-values more for good transitions and less for bad transitions.<n>Motivated by the theoretical analysis, we propose a novel algorithm, ACL-QL, which uses two learnable adaptive weight functions to control the conservative level over each transition.
Score: 46.904033006944786
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline Reinforcement Learning (RL), which operates solely on static datasets without further interactions with the environment, provides an appealing alternative to learning a safe and promising control policy. The prevailing methods typically learn a conservative policy to mitigate the problem of Q-value overestimation, but it is prone to overdo it, leading to an overly conservative policy. Moreover, they optimize all samples equally with fixed constraints, lacking the nuanced ability to control conservative levels in a fine-grained manner. Consequently, this limitation results in a performance decline. To address the above two challenges in a united way, we propose a framework, Adaptive Conservative Level in Q-Learning (ACL-QL), which limits the Q-values in a mild range and enables adaptive control on the conservative level over each state-action pair, i.e., lifting the Q-values more for good transitions and less for bad transitions. We theoretically analyze the conditions under which the conservative level of the learned Q-function can be limited in a mild range and how to optimize each transition adaptively. Motivated by the theoretical analysis, we propose a novel algorithm, ACL-QL, which uses two learnable adaptive weight functions to control the conservative level over each transition. Subsequently, we design a monotonicity loss and surrogate losses to train the adaptive weight functions, Q-function, and policy network alternatively. We evaluate ACL-QL on the commonly used D4RL benchmark and conduct extensive ablation studies to illustrate the effectiveness and state-of-the-art performance compared to existing offline DRL baselines.

Related papers

Q-STAC: Q-Guided Stein Variational Model Predictive Actor-Critic [12.837649598521102]
This paper introduces the Q-guided STein variational model predictive Actor-Critic (Q-STAC) framework for continuous control tasks.<n>Our method optimize control sequences directly using learned Q-values as objectives, eliminating the need for explicit cost function design.<n>Experiments on 2D navigation and robotic manipulation tasks demonstrate that Q-STAC achieves superior sample efficiency, robustness, and optimality compared to state-of-the-art algorithms.
arXiv Detail & Related papers (2025-07-09T07:53:53Z)
Projection Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning [1.8789068567093286]
Implicit Q-Learning (IQL) algorithm employs expectile regression to achieve in-sample learning. We propose Proj-IQL, a projective IQL algorithm enhanced with the support constraint. Proj-IQL achieves state-of-the-art performance on D4RL benchmarks.
arXiv Detail & Related papers (2025-01-15T16:17:02Z)
Constrained Reinforcement Learning with Smoothed Log Barrier Function [27.216122901635018]
We propose a new constrained RL method called CSAC-LB (Constrained Soft Actor-Critic with Log Barrier Function) It achieves competitive performance without any pre-training by applying a linear smoothed log barrier function to an additional safety critic. We show that with CSAC-LB, we achieve state-of-the-art performance on several constrained control tasks with different levels of difficulty.
arXiv Detail & Related papers (2024-03-21T16:02:52Z)
Resilient Constrained Reinforcement Learning [87.4374430686956]
We study a class of constrained reinforcement learning (RL) problems in which multiple constraint specifications are not identified before study. It is challenging to identify appropriate constraint specifications due to the undefined trade-off between the reward training objective and the constraint satisfaction. We propose a new constrained RL approach that searches for policy and constraint specifications together.
arXiv Detail & Related papers (2023-12-28T18:28:23Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. In this paper, we propose an adaptive scheme for action quantization. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z)
Deep Offline Reinforcement Learning for Real-world Treatment Optimization Applications [3.770564448216192]
We introduce a practical and theoretically grounded transition sampling approach to address action imbalance during offline RL training. We perform extensive experiments on two real-world tasks for diabetes and sepsis treatment optimization. Across a range of principled and clinically relevant metrics, we show that our proposed approach enables substantial improvements in expected health outcomes.
arXiv Detail & Related papers (2023-02-15T09:30:57Z)
Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL) We show an analogous result for vanilla Q-functions under a soft margin condition. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z)
Contextual Conservative Q-Learning for Offline Reinforcement Learning [15.819356579361843]
We propose Contextual Conservative Q-Learning(C-CQL) to learn a robustly reliable policy through the contextual information captured via an inverse dynamics model. C-CQL achieves the state-of-the-art performance in most environments of offline Mujoco suite and a noisy Mujoco setting.
arXiv Detail & Related papers (2023-01-03T13:33:54Z)
Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
Deep Constrained Q-learning [15.582910645906145]
In many real world applications, reinforcement learning agents have to optimize multiple objectives while following certain rules or satisfying a set of constraints. We propose Constrained Q-learning, a novel off-policy reinforcement learning framework restricting the action space directly in the Q-update to learn the optimal Q-function for the induced constrained MDP and the corresponding safe policy.
arXiv Detail & Related papers (2020-03-20T17:26:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.