Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks
- URL: http://arxiv.org/abs/2310.10556v2
- Date: Mon, 26 Feb 2024 23:19:35 GMT
- Title: Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks
- Authors: Zihao Li, Xiang Ji, Minshuo Chen, Mengdi Wang
- Abstract summary: We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
- Score: 58.469818546042696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recently popular approach to solving reinforcement learning is with data
from human preferences. In fact, human preference data are now used with
classic reinforcement learning algorithms such as actor-critic methods, which
involve evaluating an intermediate policy over a reward learned from human
preference data with distribution shift, known as off-policy evaluation (OPE).
Such algorithm includes (i) learning reward function from human preference
dataset, and (ii) learning expected cumulative reward of a target policy.
Despite the huge empirical success, existing OPE methods with preference data
often lack theoretical understanding and rely heavily on heuristics. In this
paper, we study the sample efficiency of OPE with human preference and
establish a statistical guarantee for it. Specifically, we approach OPE by
learning the value function by fitted-Q-evaluation with a deep neural network.
By appropriately selecting the size of a ReLU network, we show that one can
leverage any low-dimensional manifold structure in the Markov decision process
and obtain a sample-efficient estimator without suffering from the curse of
high data ambient dimensionality. Under the assumption of high reward
smoothness, our results \textit{almost align with the classical OPE results
with observable reward data}. To the best of our knowledge, this is the first
result that establishes a \textit{provably efficient} guarantee for off-policy
evaluation with RLHF.
Related papers
- Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Optimal Sample Selection Through Uncertainty Estimation and Its
Application in Deep Learning [22.410220040736235]
We present a theoretically optimal solution for addressing both coreset selection and active learning.
Our proposed method, COPS, is designed to minimize the expected loss of a model trained on subsampled data.
arXiv Detail & Related papers (2023-09-05T14:06:33Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Using Sum-Product Networks to Assess Uncertainty in Deep Active Learning [3.7507283158673212]
This paper proposes a new and very simple approach to computing uncertainty in deep active learning with a Convolutional Neural Network (CNN)
The main idea is to use the feature representation extracted by the CNN as data for training a Sum-Product Network (SPN)
arXiv Detail & Related papers (2022-06-20T14:28:19Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Invariance Learning in Deep Neural Networks with Differentiable Laplace
Approximations [76.82124752950148]
We develop a convenient gradient-based method for selecting the data augmentation.
We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective.
arXiv Detail & Related papers (2022-02-22T02:51:11Z) - Monocular Depth Estimation via Listwise Ranking using the Plackett-Luce
Model [15.472533971305367]
In many real-world applications, the relative depth of objects in an image is crucial for scene understanding.
Recent approaches mainly tackle the problem of depth prediction in monocular images by treating the problem as a regression task.
Yet, ranking methods suggest themselves as a natural alternative to regression, and indeed, ranking approaches leveraging pairwise comparisons have shown promising performance on this problem.
arXiv Detail & Related papers (2020-10-25T13:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.