Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting
- URL: http://arxiv.org/abs/2512.23805v1
- Date: Mon, 29 Dec 2025 19:04:40 GMT
- Title: Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting
- Authors: Lars van der Laan, Nathan Kallus,
- Abstract summary: We show the need for this assumption stems from a fundamental norm mismatch.<n>We propose a simple fix: reweight each regression step using an estimate of the stationary density ratio.<n>This enables strong evaluation guarantees in the absence of realizability or Bellman completeness.
- Score: 40.322273308230606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fitted Q-evaluation (FQE) is a central method for off-policy evaluation in reinforcement learning, but it generally requires Bellman completeness: that the hypothesis class is closed under the evaluation Bellman operator. This requirement is challenging because enlarging the hypothesis class can worsen completeness. We show that the need for this assumption stems from a fundamental norm mismatch: the Bellman operator is gamma-contractive under the stationary distribution of the target policy, whereas FQE minimizes Bellman error under the behavior distribution. We propose a simple fix: reweight each regression step using an estimate of the stationary density ratio, thereby aligning FQE with the norm in which the Bellman operator contracts. This enables strong evaluation guarantees in the absence of realizability or Bellman completeness, avoiding the geometric error blow-up of standard FQE in this setting while maintaining the practicality of regression-based evaluation.
Related papers
- Reward Redistribution for CVaR MDPs using a Bellman Operator on L-infinity [16.835098688159004]
Tail-end risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastrophic events.<n>We develop risk-averse value and model-free Q-learning algorithms that rely on discretized augmented states.<n> Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.
arXiv Detail & Related papers (2026-02-03T17:39:45Z) - Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration [40.322273308230606]
We show that fitted Q-iteration and its entropy-regularized variant, soft FQI, behave poorly under function approximation and distribution shift.<n>We introduce stationary-reweighted soft FQI, which reweights each regression update using the stationary distribution of the current policy.<n>Our analysis suggests that global convergence may be recovered by gradually reducing the softmax temperature.
arXiv Detail & Related papers (2025-12-30T00:58:35Z) - Bellman Calibration for V-Learning in Offline Reinforcement Learning [40.322273308230606]
We introduce Iterated Bellman, a simple, model-agnostic, post-hoc procedure for calibrating off-policy value predictions.<n>We adapt classical histogram and isotonic calibration to the dynamic, counterfactual setting.<n>This yields a one-dimensional fitted value scheme that can be applied to any value estimator.
arXiv Detail & Related papers (2025-12-29T18:52:18Z) - COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z) - To bootstrap or to rollout? An optimal and adaptive interpolation [4.755935781862859]
We introduce a class of Bellman operators that interpolate between bootstrapping and rollout methods.<n>Our estimator combines the strengths of the bootstrapping-based temporal difference (TD) estimator and the rollout-based Monte Carlo (MC) methods.
arXiv Detail & Related papers (2024-11-14T19:00:00Z) - Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise [51.87307904567702]
Quantile regression is a leading approach for obtaining such intervals via the empirical estimation of quantiles in the distribution of outputs.<n>We propose Relaxed Quantile Regression (RQR), a direct alternative to quantile regression based interval construction that removes this arbitrary constraint.<n>We demonstrate that this added flexibility results in intervals with an improvement in desirable qualities.
arXiv Detail & Related papers (2024-06-05T13:36:38Z) - Symmetric Q-learning: Reducing Skewness of Bellman Error in Online
Reinforcement Learning [55.75959755058356]
In deep reinforcement learning, estimating the value function is essential to evaluate the quality of states and actions.
A recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman operator.
We proposed a method called Symmetric Q-learning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution.
arXiv Detail & Related papers (2024-03-12T14:49:19Z) - When is Realizability Sufficient for Off-Policy Reinforcement Learning? [17.317841035807696]
We analyze the statistical complexity of off-policy reinforcement learning when only realizability holds for the prescribed function class.
We establish finite-sample guarantees for off-policy reinforcement learning that are free of the approximation error term known as inherent Bellman error.
arXiv Detail & Related papers (2022-11-10T03:15:31Z) - Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement
for Value Error [83.10489974736404]
We study the use of the Bellman equation as a surrogate objective for value prediction accuracy.
We find that the Bellman error is a poor proxy for the accuracy of the value function.
arXiv Detail & Related papers (2022-01-28T21:03:59Z) - Bayesian Bellman Operators [55.959376449737405]
We introduce a novel perspective on Bayesian reinforcement learning (RL)
Our framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions.
arXiv Detail & Related papers (2021-06-09T12:20:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.