State-Action Similarity-Based Representations for Off-Policy Evaluation
- URL: http://arxiv.org/abs/2310.18409v1
- Date: Fri, 27 Oct 2023 18:00:57 GMT
- Title: State-Action Similarity-Based Representations for Off-Policy Evaluation
- Authors: Brahma S. Pavse and Josiah P. Hanna
- Abstract summary: We introduce an OPE-tailored state-action behavioral similarity metric, and use this metric and the fixed dataset to learn an encoder that models this metric.
We show that our state-action representation method boosts the data-efficiency of FQE and OPE error relative to other OPE-based representation learning methods on challenging OPE tasks.
- Score: 7.428147895832805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning, off-policy evaluation (OPE) is the problem of
estimating the expected return of an evaluation policy given a fixed dataset
that was collected by running one or more different policies. One of the more
empirically successful algorithms for OPE has been the fitted q-evaluation
(FQE) algorithm that uses temporal difference updates to learn an action-value
function, which is then used to estimate the expected return of the evaluation
policy. Typically, the original fixed dataset is fed directly into FQE to learn
the action-value function of the evaluation policy. Instead, in this paper, we
seek to enhance the data-efficiency of FQE by first transforming the fixed
dataset using a learned encoder, and then feeding the transformed dataset into
FQE. To learn such an encoder, we introduce an OPE-tailored state-action
behavioral similarity metric, and use this metric and the fixed dataset to
learn an encoder that models this metric. Theoretically, we show that this
metric allows us to bound the error in the resulting OPE estimate. Empirically,
we show that other state-action similarity metrics lead to representations that
cannot represent the action-value function of the evaluation policy, and that
our state-action representation method boosts the data-efficiency of FQE and
lowers OPE error relative to other OPE-based representation learning methods on
challenging OPE tasks. We also empirically show that the learned
representations significantly mitigate divergence of FQE under varying
distribution shifts. Our code is available here:
https://github.com/Badger-RL/ROPE.
Related papers
- Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Offline RL with No OOD Actions: In-Sample Learning via Implicit Value
Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples.
We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework.
We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z) - Value-Consistent Representation Learning for Data-Efficient
Reinforcement Learning [105.70602423944148]
We propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making.
Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values.
It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
arXiv Detail & Related papers (2022-06-25T03:02:25Z) - Diversity Enhanced Active Learning with Strictly Proper Scoring Rules [4.81450893955064]
We study acquisition functions for active learning (AL) for text classification.
We convert the Expected Loss Reduction (ELR) method to estimate the increase in (strictly proper) scores like log probability or negative mean square error.
We show that the use of mean square error and log probability with BEMPS yields robust acquisition functions.
arXiv Detail & Related papers (2021-10-27T05:02:11Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.