Related papers: Evaluation-Aware Reinforcement Learning

Evaluation-Aware Reinforcement Learning

URL: http://arxiv.org/abs/2509.19464v1
Date: Tue, 23 Sep 2025 18:17:21 GMT
Title: Evaluation-Aware Reinforcement Learning
Authors: Shripad Vilasrao Deshmukh, Will Schwarzer, Scott Niekum,
Abstract summary: Policy evaluation is often a prerequisite for deploying safety- and performance-critical systems.<n>We propose evaluation-aware reinforcement learning (EvA-RL), in which a policy is trained to maximize expected return.<n>We show that EvA-RL can substantially reduce evaluation error while maintaining competitive returns.
Score: 10.594563233900004
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Policy evaluation is often a prerequisite for deploying safety- and performance-critical systems. Existing evaluation approaches frequently suffer from high variance due to limited data and long-horizon tasks, or high bias due to unequal support or inaccurate environmental models. We posit that these challenges arise, in part, from the standard reinforcement learning (RL) paradigm of policy learning without explicit consideration of evaluation. As an alternative, we propose evaluation-aware reinforcement learning (EvA-RL), in which a policy is trained to maximize expected return while simultaneously minimizing expected evaluation error under a given value prediction scheme -- in other words, being "easy" to evaluate. We formalize a framework for EvA-RL and design an instantiation that enables accurate policy evaluation, conditioned on a small number of rollouts in an assessment environment that can be different than the deployment environment. However, our theoretical analysis and empirical results show that there is often a tradeoff between evaluation accuracy and policy performance when using a fixed value-prediction scheme within EvA-RL. To mitigate this tradeoff, we extend our approach to co-learn an assessment-conditioned state-value predictor alongside the policy. Empirical results across diverse discrete and continuous action domains demonstrate that EvA-RL can substantially reduce evaluation error while maintaining competitive returns. This work lays the foundation for a broad new class of RL methods that treat reliable evaluation as a first-class principle during training.

Related papers

ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training [15.70383059978939]
We study how to improve large foundation vision--action (VLA) systems through online reinforcement learning (RL) in real-world settings.<n>In practice, the value function is estimated from trajectory fragments collected from different data sources.<n>We propose ALOE, an action-level off-policy evaluation framework for VLA post-training.
arXiv Detail & Related papers (2026-02-13T07:46:37Z)
What Makes Value Learning Efficient in Residual Reinforcement Learning? [57.635661297706065]
Residual reinforcement learning (RL) enables stable online refinement of expressive pretrained policies by freezing the base and learning only bounded corrections.<n>In this work, we identify two key bottlenecks: cold start pathology, where the critic lacks knowledge of the value landscape around the base policy, and structural scale mismatch.<n>We propose DAWN, a minimal approach targeting efficient value learning in residual RL.
arXiv Detail & Related papers (2026-02-11T05:25:39Z)
Random Policy Enables In-Context Reinforcement Learning within Trust Horizons [2.52299400625445]
State-Action Distillation (SAD) generates an effective pretraining dataset guided solely by random policies.<n>SAD outperforms the best baseline by 236.3% in the offline evaluation and by 135.2% in the online evaluation.
arXiv Detail & Related papers (2024-10-25T21:46:25Z)
Conservative State Value Estimation for Offline Reinforcement Learning [36.416504941791224]
Conservative State Value Estimation (CSVE) learns conservative V-function via directly imposing penalty on OOD states. We develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states empharound the dataset. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
arXiv Detail & Related papers (2023-02-14T08:13:55Z)
Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation. Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z)
Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z)
Bootstrapping Statistical Inference for Off-Policy Evaluation [43.79456564713911]
We study the use of bootstrapping in off-policy evaluation (OPE) We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is efficient and consistent for off-policy statistical inference. We evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.
arXiv Detail & Related papers (2021-02-06T16:45:33Z)
Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy. We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)
Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes. We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation [21.703965401500913]
We propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and 3) we propose a way to interpret ESRL's policy at every state through posterior distributions.
arXiv Detail & Related papers (2020-06-23T17:43:44Z)
Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL) We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA) KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.