Related papers: Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization

Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization

URL: http://arxiv.org/abs/2111.13807v1
Date: Sat, 27 Nov 2021 03:57:13 GMT
Title: Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization
Authors: Thanh Nguyen-Tang, Sunil Gupta, A.Tuan Nguyen, Svetha Venkatesh
Abstract summary: We propose a provably efficient offline contextual bandit with neural network function approximation. We show that our method generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works. We also demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems.
Score: 42.865641215856925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Offline policy learning (OPL) leverages existing data collected a priori for policy optimization without any active exploration. Despite the prevalence and recent interest in this problem, its theoretical and algorithmic foundations in function approximation settings remain under-developed. In this paper, we consider this problem on the axes of distributional shift, optimization, and generalization in offline contextual bandits with neural networks. In particular, we propose a provably efficient offline contextual bandit with neural network function approximation that does not require any functional assumption on the reward. We show that our method provably generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works. Notably, unlike any other OPL method, our method learns from the offline data in an online manner using stochastic gradient descent, allowing us to leverage the benefits of online learning into an offline setting. Moreover, we show that our method is more computationally efficient and has a better dependence on the effective dimension of the neural network than an online counterpart. Finally, we demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems.

Related papers

Active Human Feedback Collection via Neural Contextual Dueling Bandits [84.7608942821423]
We propose Neural-ADB, an algorithm for collecting human preference feedback when the underlying latent reward function is non-linear. We show that when preference feedback follows the Bradley-Terry-Luce model, the worst sub-optimality gap of the policy learned by Neural-ADB decreases at a sub-linear rate as the preference dataset increases.
arXiv Detail & Related papers (2025-04-16T12:16:10Z)
Evaluation-Time Policy Switching for Offline Reinforcement Learning [5.052293146674794]
offline reinforcement learning (RL) looks at learning how to optimally solve tasks using a fixed dataset of interactions from the environment. Many off-policy algorithms developed for online learning struggle in the offline setting as they tend to over-estimate the behaviour of out of distributions of actions. Existing offline RL algorithms adapt off-policy algorithms, employing techniques such as constraining the policy or modifying the value function to achieve good performance on individual datasets. We introduce a policy switching technique that dynamically combines the behaviour of a pure off-policy RL agent, for improving behaviour, and a behavioural cloning (BC) agent, for staying close to the
arXiv Detail & Related papers (2025-03-15T18:12:16Z)
The Importance of Online Data: Understanding Preference Fine-tuning via Coverage [25.782644676250115]
We study the similarities and differences between online and offline techniques for preference fine-tuning. We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy. We derive a hybrid preference optimization algorithm that uses offline data for contrastive-based preference optimization and online data for KL regularization.
arXiv Detail & Related papers (2024-06-03T15:51:04Z)
Understanding the performance gap between online and offline alignment algorithms [63.137832242488926]
We show that offline algorithms train policy to become good at pairwise classification, while online algorithms are good at generations. This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process. Our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.
arXiv Detail & Related papers (2024-05-14T09:12:30Z)
On Sample-Efficient Offline Reinforcement Learning: Data Diversity, Posterior Sampling, and Beyond [29.449446595110643]
We propose a notion of data diversity that subsumes the previous notions of coverage measures in offline RL. Our proposed model-free PS-based algorithm for offline RL is novel, with sub-optimality bounds that are frequentist (i.e., worst-case) in nature.
arXiv Detail & Related papers (2024-01-06T20:52:04Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)
Online Network Source Optimization with Graph-Kernel MAB [62.6067511147939]
We propose Grab-UCB, a graph- kernel multi-arms bandit algorithm to learn online the optimal source placement in large scale networks. We describe the network processes with an adaptive graph dictionary model, which typically leads to sparse spectral representations. We derive the performance guarantees that depend on network parameters, which further influence the learning curve of the sequential decision strategy.
arXiv Detail & Related papers (2023-07-07T15:03:42Z)
Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning. We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing. We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z)
Model-Free Learning of Optimal Deterministic Resource Allocations in Wireless Systems via Action-Space Exploration [4.721069729610892]
We propose a technically grounded and scalable deterministic-dual gradient policy method for efficiently learning optimal parameterized resource allocation policies. Our method not only efficiently exploits gradient availability of popular universal representations such as deep networks, but is also truly model-free, as it relies on consistent zeroth-order gradient approximations of associated random network services constructed via low-dimensional perturbations in action space.
arXiv Detail & Related papers (2021-08-23T18:26:16Z)
What are the Statistical Limits of Offline RL with Linear Function Approximation? [70.33301077240763]
offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of sequential decision making strategies. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning.
arXiv Detail & Related papers (2020-10-22T17:32:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.