Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach
- URL: http://arxiv.org/abs/2510.19528v1
- Date: Wed, 22 Oct 2025 12:32:52 GMT
- Title: Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach
- Authors: Sebastian Reboul, Hélène Halconruy, Randal Douc,
- Abstract summary: This study centers on how to learn and apply value envelopes within this context.<n>We introduce a principled two-stage framework: the first stage uses offline data to derive upper and lower bounds on value functions, while the second incorporates these learned bounds into online algorithms.
- Score: 2.9690567171043725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the fundamental problem of leveraging offline data to accelerate online reinforcement learning - a direction with strong potential but limited theoretical grounding. Our study centers on how to learn and apply value envelopes within this context. To this end, we introduce a principled two-stage framework: the first stage uses offline data to derive upper and lower bounds on value functions, while the second incorporates these learned bounds into online algorithms. Our method extends prior work by decoupling the upper and lower bounds, enabling more flexible and tighter approximations. In contrast to approaches that rely on fixed shaping functions, our envelopes are data-driven and explicitly modeled as random variables, with a filtration argument ensuring independence across phases. The analysis establishes high-probability regret bounds determined by two interpretable quantities, thereby providing a formal bridge between offline pre-training and online fine-tuning. Empirical results on tabular MDPs demonstrate substantial regret reductions compared with both UCBVI and prior methods.
Related papers
- Temporal Difference Learning with Constrained Initial Representations [41.31941267662611]
We introduce the Tanh function into the initial layer to fulfill such a constraint.<n>We present our Constrained Initial Representations framework, tagged CIR, which is made up of three components.<n> Empirical results show that CIR exhibits strong performance on numerous continuous control tasks.
arXiv Detail & Related papers (2026-02-12T10:27:57Z) - Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning [8.657536710294766]
We propose a two-stage framework that learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback.<n>We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective.<n>We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL.
arXiv Detail & Related papers (2025-09-30T17:50:19Z) - Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning [69.64809103333839]
We investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning.<n>Our approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
arXiv Detail & Related papers (2025-05-19T15:43:10Z) - Continual Multimodal Contrastive Learning [99.53621521696051]
Multimodal Contrastive Learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space.<n>However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive.<n>In this paper, we formulate CMCL through two specialized principles of stability and plasticity.<n>We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z) - Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning [8.087699764574788]
We propose an efficient algorithm for offline preference-based reinforcement learning (PbRL)<n>APPO guarantees sample complexity bounds without relying on explicit confidence sets.<n>To our knowledge, APPO is the first offline PbRL algorithm to offer both statistical efficiency and practical applicability.
arXiv Detail & Related papers (2025-03-07T10:35:01Z) - All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [49.43901716932925]
We show that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure.<n>Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide online feedback.<n>We find the most support for the explanation that on problems with a generation-verification gap, it is relatively easy to learn the relatively simple RM from the preference data.
arXiv Detail & Related papers (2025-03-03T00:15:19Z) - Offline Learning for Combinatorial Multi-armed Bandits [56.96242764723241]
Off-CMAB is the first offline learning framework for CMAB.<n>Off-CMAB combines pessimistic reward estimations with solvers.<n>Experiments on synthetic and real-world datasets highlight the superior performance of CLCB.
arXiv Detail & Related papers (2025-01-31T16:56:18Z) - Characterizing the Training Dynamics of Private Fine-tuning with Langevin diffusion [37.98959061338993]
We show that differentially private full fine-tuning (DP-FFT) can distort pre-trained backbone features based on both theoretical and empirical results.<n>We prove that a sequential fine-tuning strategy can mitigate the feature distortion.
arXiv Detail & Related papers (2024-02-29T07:01:48Z) - Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment [105.34140537748546]
We propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained quality signals that are derived by contrasting good and bad responses.
Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones.
Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment.
arXiv Detail & Related papers (2023-11-07T15:36:40Z) - Near-optimal Offline Reinforcement Learning with Linear Representation:
Leveraging Variance Information with Pessimism [65.46524775457928]
offline reinforcement learning seeks to utilize offline/historical data to optimize sequential decision-making strategies.
We study the statistical limits of offline reinforcement learning with linear model representations.
arXiv Detail & Related papers (2022-03-11T09:00:12Z) - Offline Reinforcement Learning: Fundamental Barriers for Value Function
Approximation [74.3002974673248]
We consider the offline reinforcement learning problem, where the aim is to learn a decision making policy from logged data.
offline RL is becoming increasingly relevant in practice, because online data collection is well suited to safety-critical domains.
Our results show that sample-efficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions that go beyond complexity learning.
arXiv Detail & Related papers (2021-11-21T23:22:37Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - Best-Case Lower Bounds in Online Learning [9.01310450044549]
Much of the work in online learning focuses on the study of sublinear upper bounds on the regret.
In this work, we initiate the study of best-case lower bounds in online convex optimization.
We show that the linearized version of FTRL can attain negative linear regret.
arXiv Detail & Related papers (2021-06-23T23:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.