Offline Neural Contextual Bandits: Pessimism, Optimization and
Generalization
- URL: http://arxiv.org/abs/2111.13807v1
- Date: Sat, 27 Nov 2021 03:57:13 GMT
- Title: Offline Neural Contextual Bandits: Pessimism, Optimization and
Generalization
- Authors: Thanh Nguyen-Tang, Sunil Gupta, A.Tuan Nguyen, Svetha Venkatesh
- Abstract summary: We propose a provably efficient offline contextual bandit with neural network function approximation.
We show that our method generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works.
We also demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems.
- Score: 42.865641215856925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline policy learning (OPL) leverages existing data collected a priori for
policy optimization without any active exploration. Despite the prevalence and
recent interest in this problem, its theoretical and algorithmic foundations in
function approximation settings remain under-developed. In this paper, we
consider this problem on the axes of distributional shift, optimization, and
generalization in offline contextual bandits with neural networks. In
particular, we propose a provably efficient offline contextual bandit with
neural network function approximation that does not require any functional
assumption on the reward. We show that our method provably generalizes over
unseen contexts under a milder condition for distributional shift than the
existing OPL works. Notably, unlike any other OPL method, our method learns
from the offline data in an online manner using stochastic gradient descent,
allowing us to leverage the benefits of online learning into an offline
setting. Moreover, we show that our method is more computationally efficient
and has a better dependence on the effective dimension of the neural network
than an online counterpart. Finally, we demonstrate the empirical effectiveness
of our method in a range of synthetic and real-world OPL problems.
Related papers
- The Importance of Online Data: Understanding Preference Fine-tuning via Coverage [25.782644676250115]
We study the similarities and differences between online and offline techniques for preference fine-tuning.
We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy.
We derive a hybrid preference optimization algorithm that uses offline data for contrastive-based preference optimization and online data for KL regularization.
arXiv Detail & Related papers (2024-06-03T15:51:04Z) - Understanding the performance gap between online and offline alignment algorithms [63.137832242488926]
We show that offline algorithms train policy to become good at pairwise classification, while online algorithms are good at generations.
This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process.
Our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.
arXiv Detail & Related papers (2024-05-14T09:12:30Z) - On Sample-Efficient Offline Reinforcement Learning: Data Diversity,
Posterior Sampling, and Beyond [29.449446595110643]
We propose a notion of data diversity that subsumes the previous notions of coverage measures in offline RL.
Our proposed model-free PS-based algorithm for offline RL is novel, with sub-optimality bounds that are frequentist (i.e., worst-case) in nature.
arXiv Detail & Related papers (2024-01-06T20:52:04Z) - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF)
We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment.
We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z) - Online Network Source Optimization with Graph-Kernel MAB [62.6067511147939]
We propose Grab-UCB, a graph- kernel multi-arms bandit algorithm to learn online the optimal source placement in large scale networks.
We describe the network processes with an adaptive graph dictionary model, which typically leads to sparse spectral representations.
We derive the performance guarantees that depend on network parameters, which further influence the learning curve of the sequential decision strategy.
arXiv Detail & Related papers (2023-07-07T15:03:42Z) - Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning.
We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing.
We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z) - Model-Free Learning of Optimal Deterministic Resource Allocations in
Wireless Systems via Action-Space Exploration [4.721069729610892]
We propose a technically grounded and scalable deterministic-dual gradient policy method for efficiently learning optimal parameterized resource allocation policies.
Our method not only efficiently exploits gradient availability of popular universal representations such as deep networks, but is also truly model-free, as it relies on consistent zeroth-order gradient approximations of associated random network services constructed via low-dimensional perturbations in action space.
arXiv Detail & Related papers (2021-08-23T18:26:16Z) - What are the Statistical Limits of Offline RL with Linear Function
Approximation? [70.33301077240763]
offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of sequential decision making strategies.
This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning.
arXiv Detail & Related papers (2020-10-22T17:32:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.