From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism
- URL: http://arxiv.org/abs/2602.08655v2
- Date: Mon, 16 Feb 2026 09:49:12 GMT
- Title: From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism
- Authors: Sarthak Wanjari,
- Abstract summary: Methods like CQL offer rigorous conservatism but require tremendous compute power.<n>IQL often fail to correct OOD errors on pathological datasets, collapsing to Behavioural Cloning.<n>We propose Geometric Pessimism, a compute-efficient framework that augments standard IQL with density-based penalty.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline Reinforcement Learning (RL) promises the recovery of optimal policies from static datasets, yet it remains susceptible to the overestimation of out-of-distribution (OOD) actions, particularly in fractured and sparse data manifolds. Current solutions necessitate a trade-off between computational efficiency and performance. Methods like CQL offer rigorous conservatism but require tremendous compute power while efficient expectile-based methods like IQL often fail to correct OOD errors on pathological datasets, collapsing to Behavioural Cloning. In this work, we propose Geometric Pessimism, a modular, compute-efficient framework that augments standard IQL with density-based penalty derived from k-nearest-neighbour distances in the state-action embedding space. By pre-computing the penalties applied to each state-action pair, our method injects OOD conservatism via reward shaping with a O(1) training overhead to the training loop. Evaluated on the D4RL MuJoCo benchmark, our method, Geo-IQL outperforms standard IQL on sensitive and unstable medium-replay tasks by over 18 points, while reducing inter-seed standard-deviation by 4 times. Furthermore, Geo-IQL does not degrade performance on stable manifolds. Crucially, we validate our algorithm on the MIMIC-III Sepsis critical care dataset. While standard IQL collapses to behaviour cloning, Geo-IQL demonstrates active policy improvement. Maintaining safety constraints, it achieves 86.4% terminal agreement with clinicians compared to IQL's 75%. Our results suggest that geometric pessimism provides the necessary regularisation to safely overcome local optima in critical, real-world decision systems.
Related papers
- CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning [7.5200963577855875]
Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to backdoor attacks.<n>We propose CS-GBA (Critical Sample-based Gradient-guided Backdoor Attack), a novel framework designed to achieve high stealthiness and destructiveness under a strict budget.
arXiv Detail & Related papers (2026-01-15T13:57:52Z) - Benchmarking Offline Multi-Objective Reinforcement Learning in Critical Care [0.07161783472741748]
In critical care settings, clinicians face the challenge of balancing conflicting objectives, primarily maximizing patient survival while minimizing resource utilization.<n>Single-objective Reinforcement Learning approaches typically address this by optimizing a fixed scalarized reward function, resulting in rigid policies that fail to adapt to varying clinical priorities.<n>This paper benchmarks three offline MORL algorithms against three single-objective baselines on the MIMIC-IV dataset.
arXiv Detail & Related papers (2025-12-08T20:09:15Z) - Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning [52.03884701766989]
offline reinforcement learning (RL) algorithms typically impose constraints on action selection.<n>We propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions.<n>We develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint.
arXiv Detail & Related papers (2025-11-04T13:42:05Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - From Imitation to Optimization: A Comparative Study of Offline Learning for Autonomous Driving [0.0]
This work presents a comprehensive pipeline and a comparative study to address this limitation.<n>We first develop a series of increasingly sophisticated Behavioral Cloning (BC) baselines.<n>We then demonstrate that by applying a state-of-the-art Offline Reinforcement Learning algorithm, Conservative Q-Learning (CQL), to the same data and architecture, we can learn a significantly more robust policy.
arXiv Detail & Related papers (2025-08-09T16:03:10Z) - ACL-QL: Adaptive Conservative Level in Q-Learning for Offline Reinforcement Learning [46.67828766038463]
We propose a framework, Adaptive Conservative Level in Q-Learning (ACL-QL), which limits the Q-values in a mild range.<n>ACL-QL enables adaptive control on the conservative level over each state-action pair, i.e., lifting the Q-values more for good transitions and less for bad transitions.<n>Motivated by the theoretical analysis, we propose a novel algorithm, ACL-QL, which uses two learnable adaptive weight functions to control the conservative level over each transition.
arXiv Detail & Related papers (2024-12-22T04:18:02Z) - Hypercube Policy Regularization Framework for Offline Reinforcement Learning [2.01030009289749]
This paper proposes a hypercube policy regularization framework.
It allows the agent to explore the actions corresponding to similar states in the static dataset.
It was theoretically demonstrated that the hypercube policy regularization framework can effectively improve the performance of original algorithms.
arXiv Detail & Related papers (2024-11-07T08:48:32Z) - Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL)
We show an analogous result for vanilla Q-functions under a soft margin condition.
Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z) - Stochastic Optimization of Areas Under Precision-Recall Curves with
Provable Convergence [66.83161885378192]
Area under ROC (AUROC) and precision-recall curves (AUPRC) are common metrics for evaluating classification performance for imbalanced problems.
We propose a technical method to optimize AUPRC for deep learning.
arXiv Detail & Related papers (2021-04-18T06:22:21Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.