BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning
- URL: http://arxiv.org/abs/2511.22210v1
- Date: Thu, 27 Nov 2025 08:27:10 GMT
- Title: BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning
- Authors: Junsung Park,
- Abstract summary: BiCQL-ML is a policy-free offline inverse reinforcement learning algorithm.<n>We show BiCQL-ML improves both reward recovery and downstream policy performance.
- Score: 10.016258281947122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.
Related papers
- In-Context Compositional Q-Learning for Offline Reinforcement Learning [21.45273716299317]
We propose In-context Compositional Q-Learning (textttICQL), the first offline RL framework that formulates Q-learning as a contextual inference problem.<n>We show that under two assumptions--linear approximability of the local Q-function and accurate weight inference--textttICQL achieves bounded Q-function approximation error.<n> Empirically, textttICQL substantially improves performance in offline settings.
arXiv Detail & Related papers (2025-09-28T20:55:21Z) - Scalable In-Context Q-Learning [68.9917436397079]
We propose textbfScalable textbfIn-textbfContext textbfQ-textbfLearning (textbfSICQL) to steer in-context reinforcement learning.<n>textbfSICQL harnesses dynamic programming and world modeling to steer ICRL toward efficient reward and task generalization.
arXiv Detail & Related papers (2025-06-02T04:21:56Z) - Projection Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning [1.8789068567093286]
Implicit Q-Learning (IQL) algorithm employs expectile regression to achieve in-sample learning.<n>We propose Proj-IQL, a projective IQL algorithm enhanced with the support constraint.<n>Proj-IQL achieves state-of-the-art performance on D4RL benchmarks.
arXiv Detail & Related papers (2025-01-15T16:17:02Z) - ACL-QL: Adaptive Conservative Level in Q-Learning for Offline Reinforcement Learning [46.67828766038463]
We propose a framework, Adaptive Conservative Level in Q-Learning (ACL-QL), which limits the Q-values in a mild range.<n>ACL-QL enables adaptive control on the conservative level over each state-action pair, i.e., lifting the Q-values more for good transitions and less for bad transitions.<n>Motivated by the theoretical analysis, we propose a novel algorithm, ACL-QL, which uses two learnable adaptive weight functions to control the conservative level over each transition.
arXiv Detail & Related papers (2024-12-22T04:18:02Z) - IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion
Policies [72.4573167739712]
Implicit Q-learning (IQL) trains a Q-function using only dataset actions through a modified Bellman backup.
It is unclear which policy actually attains the values represented by this trained Q-function.
We introduce Implicit Q-learning (IDQL), combining our general IQL critic with the policy extraction method.
arXiv Detail & Related papers (2023-04-20T18:04:09Z) - Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL)
We show an analogous result for vanilla Q-functions under a soft margin condition.
Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.