Related papers: Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation

Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation

URL: http://arxiv.org/abs/2510.26026v1
Date: Wed, 29 Oct 2025 23:45:44 GMT
Title: Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation
Authors: Feichen Gan, Youcun Lu, Yingying Zhang, Yukun Liu,
Abstract summary: We propose a unified conformal prediction framework for infinite-horizon policy evaluation.<n>Our method integrates distributional RL with conformal calibration, addressing challenges such as unobserved returns, temporal dependencies, and distributional shifts.
Score: 14.811119246685534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable uncertainty quantification is crucial for reinforcement learning (RL) in high-stakes settings. We propose a unified conformal prediction framework for infinite-horizon policy evaluation that constructs distribution-free prediction intervals {for returns} in both on-policy and off-policy settings. Our method integrates distributional RL with conformal calibration, addressing challenges such as unobserved returns, temporal dependencies, and distributional shifts. We propose a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling. These innovations mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts. Our theoretical analysis provides coverage guarantees that account for model misspecification and importance weight estimation. Empirical results, including experiments in synthetic and benchmark environments like Mountain Car, show that our method significantly improves coverage and reliability over standard distributional RL baselines.

Related papers

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Adaptive Regime-Switching Forecasts with Distribution-Free Uncertainty: Deep Switching State-Space Models Meet Conformal Prediction [38.37518767859008]
We study distribution-free uncertainty for regime-switching forecasting by coupling Deep Switching State Space Models with Adaptive Conformal Inference (ACI) and its aggregated variant (AgACI)<n>We also introduce a unified conformal wrapper that sits atop strong sequence baselines including S4, MC-Dropout GRU, sparse Gaussian processes, and a change-point local model to produce online predictive bands with finite-sample marginal guarantees under nonstationarity and model misspecification.
arXiv Detail & Related papers (2025-12-02T23:21:01Z)
A Gentle Introduction to Conformal Time Series Forecasting [0.0]
This review unifies recent advances in conformal forecasting methods specifically designed to address nonexchangeable data.<n>We first present a theoretical foundation, deriving finite-sample guarantees for split-conformal prediction under mild weak-dependence conditions.<n>We then survey and classify state-of-the-art approaches that mitigate serial dependence by reweighting calibration data, dynamically updating residual distributions, or adaptively tuning target coverage levels in real time.
arXiv Detail & Related papers (2025-11-17T17:12:51Z)
Convergence and Generalization of Anti-Regularization for Parametric Models [0.0]
Anti-regularization introduces a reward term with a reversed sign into the loss function.<n>We formalize spectral safety conditions and trust-region constraints.<n>We design a lightweight safeguard that combines a projection operator with gradient clipping to guarantee stable intervention.
arXiv Detail & Related papers (2025-08-24T15:34:17Z)
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Rectifying Conformity Scores for Better Conditional Coverage [75.73184036344908]
We present a new method for generating confidence sets within the split conformal prediction framework.<n>Our method performs a trainable transformation of any given conformity score to improve conditional coverage while ensuring exact marginal coverage.
arXiv Detail & Related papers (2025-02-22T19:54:14Z)
Calibrated Probabilistic Forecasts for Arbitrary Sequences [58.54729945445505]
Real-world data streams can change unpredictably due to distribution shifts, feedback loops and adversarial actors.<n>We present a forecasting framework ensuring valid uncertainty estimates regardless of how data evolves.
arXiv Detail & Related papers (2024-09-27T21:46:42Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
DROMO: Distributionally Robust Offline Model-based Policy Optimization [0.0]
We consider the problem of offline reinforcement learning with model-based control. We propose distributionally robust offline model-based policy optimization (DROMO)
arXiv Detail & Related papers (2021-09-15T13:25:14Z)
GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications. Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions. The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.