Related papers: The In-Sample Softmax for Offline Reinforcement Learning

The In-Sample Softmax for Offline Reinforcement Learning

URL: http://arxiv.org/abs/2302.14372v2
Date: Wed, 19 Apr 2023 04:13:38 GMT
Title: The In-Sample Softmax for Offline Reinforcement Learning
Authors: Chenjun Xiao, Han Wang, Yangchen Pan, Adam White, Martha White
Abstract summary: Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. Standard max operator may select a maximal action that has not been seen in the dataset. bootstrapping from these inaccurate values can lead to overestimation and even divergence.
Score: 37.37457955279337
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our methods suffers from insufficient action-coverage: standard max operator may select a maximal action that has not been seen in the dataset. Bootstrapping from these inaccurate values can lead to overestimation and even divergence. There are a growing number of methods that attempt to approximate an \emph{in-sample} max, that only uses actions well-covered by the dataset. We highlight a simple fact: it is more straightforward to approximate an in-sample \emph{softmax} using only actions in the dataset. We show that policy iteration based on the in-sample softmax converges, and that for decreasing temperatures it approaches the in-sample max. We derive an In-Sample Actor-Critic (AC), using this in-sample softmax, and show that it is consistently better or comparable to existing offline RL methods, and is also well-suited to fine-tuning.

Related papers

Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI) Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions. These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z)
To Softmax, or not to Softmax: that is the question when applying Active Learning for Transformer Models [24.43410365335306]
A well known technique to reduce the amount of human effort in acquiring a labeled dataset is textitActive Learning (AL) This paper compares eight alternatives on seven datasets. Most of the methods are too good at identifying the true most uncertain samples (outliers) and that labeling exclusively results in worse performance.
arXiv Detail & Related papers (2022-10-06T15:51:39Z)
Enhancing Classifier Conservativeness and Robustness by Polynomiality [23.099278014212146]
We show howconditionality can remedy the situation. A directly related, simple, yet important technical novelty we subsequently present is softRmax. We show that two aspects of softRmax, conservativeness and inherent robustness, lead to adversarial regularization.
arXiv Detail & Related papers (2022-03-23T19:36:19Z)
Breaking the Softmax Bottleneck for Sequential Recommender Systems with Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs. We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems. Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z)
Not Far Away, Not So Close: Sample Efficient Nearest Neighbour Data Augmentation via MiniMax [7.680863481076596]
MiniMax-kNN is a sample efficient data augmentation strategy. We exploit a semi-supervised approach based on knowledge distillation to train a model on augmented data.
arXiv Detail & Related papers (2021-05-28T06:32:32Z)
Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z)
Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z)
Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$. We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z)
Active Sampling for Min-Max Fairness [28.420886416425077]
We propose simple active sampling and reweighting strategies for optimizing min-max fairness. The ease of implementation and the generality of our robust formulation make it an attractive option for improving model performance on disadvantaged groups. For convex learning problems, such as linear or logistic regression, we provide a fine-grained analysis, proving the rate of convergence to a min-max fair solution.
arXiv Detail & Related papers (2020-06-11T23:57:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.