Related papers: q-exponential family for policy optimization

q-exponential family for policy optimization

URL: http://arxiv.org/abs/2408.07245v3
Date: Fri, 24 Jan 2025 12:17:00 GMT
Title: q-exponential family for policy optimization
Authors: Lingwei Zhu, Haseeb Shah, Han Wang, Yukie Nagai, Martha White,
Abstract summary: In this paper, we consider a broader policy family that remains tractable: the $q$-exponential family.<n>This family of policies is flexible, allowing the specification of both heavy-tailed policies ($q>1$) and light-tailed policies ($q1$)
Score: 20.24534119264188
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Policy optimization methods benefit from a simple and tractable policy parametrization, usually the Gaussian for continuous action spaces. In this paper, we consider a broader policy family that remains tractable: the $q$-exponential family. This family of policies is flexible, allowing the specification of both heavy-tailed policies ($q>1$) and light-tailed policies ($q<1$). This paper examines the interplay between $q$-exponential policies for several actor-critic algorithms conducted on both online and offline problems. We find that heavy-tailed policies are more effective in general and can consistently improve on Gaussian. In particular, we find the Student's t-distribution to be more stable than the Gaussian across settings and that a heavy-tailed $q$-Gaussian for Tsallis Advantage Weighted Actor-Critic consistently performs well in offline benchmark problems. Our code is available at \url{https://github.com/lingweizhu/qexp}.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Quantile-Optimal Policy Learning under Unmeasured Confounding [55.72891849926314]
We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$.<n>Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset.
arXiv Detail & Related papers (2025-06-08T13:37:38Z)
Fat-to-Thin Policy Optimization: Offline RL with Sparse Policies [5.5938591697033555]
Sparse continuous policies are distributions that choose some actions at random yet keep strictly zero probability for the other actions. In this paper, we propose the first offline policy optimization algorithm that tackles this challenge: Fat-to-Thin Policy Optimization (FtTPO) We instantiate FtTPO with the general $q$-Gaussian family that encompasses both heavy-tailed and sparse policies.
arXiv Detail & Related papers (2025-01-24T10:11:48Z)
Information Theoretic Guarantees For Policy Alignment In Large Language Models [19.315342870604113]
We show that the $sqrtmathsfKL$ information theoretic upper bound holds if the reward under the reference policy has sub-gaussian tails. We also prove for the best of $n$ policy, that the $mathsfKL$ upper bound can be obtained for any $f$-divergence.
arXiv Detail & Related papers (2024-06-09T18:41:50Z)
Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally. In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z)
Offline Imitation Learning with Suboptimal Demonstrations via Relaxed Distribution Matching [109.5084863685397]
offline imitation learning (IL) promises the ability to learn performant policies from pre-collected demonstrations without interactions with the environment. We present RelaxDICE, which employs an asymmetrically-relaxed f-divergence for explicit support regularization. Our method significantly outperforms the best prior offline method in six standard continuous control environments.
arXiv Detail & Related papers (2023-03-05T03:35:11Z)
Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z)
Offline Reinforcement Learning with Closed-Form Policy Improvement Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. In this paper, we propose our closed-form policy improvement operators. We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z)
Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z)
On Gap-dependent Bounds for Offline Reinforcement Learning [40.92345387517103]
This paper presents a systematic study on gap-dependent sample complexity in offline reinforcement learning. Under the optimal policy coverage assumption, the rate can be improved to $Oleft(frac1epsilonright)$ when there is a positive sub-optimality gap in the optimal $Q$-function. We show when the visitation probabilities of the behavior policy are uniformly lower bounded for states where an optimal policy's visitation probabilities are positive, the sample complexity of identifying an optimal policy is independent of $frac1epsilon$.
arXiv Detail & Related papers (2022-06-01T01:44:12Z)
Efficient Policy Iteration for Robust Markov Decision Processes via Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known. Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs. In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z)
Restless Bandits with Many Arms: Beating the Central Limit Theorem [25.639496138046546]
finite-horizon restless bandits with multiple pulls per period play an important role in recommender systems, active learning, revenue management, and many other areas. While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms $N$. We characterize a non-degeneracy condition and a class of novel practically-computable policies, called fluid-priority policies, in which the optimality gap is $O(1)$.
arXiv Detail & Related papers (2021-07-25T23:27:12Z)
Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning [59.02541753781001]
This paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" We first design a sharp offline reduction algorithm that finds an $varepsilon$ near-optimal policy within $widetildeO(H3SCstar/varepsilon2)$ episodes. We then establish an $Omega(H3SminCstar, A/varepsilon2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the
arXiv Detail & Related papers (2021-06-09T08:28:55Z)
On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE. Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal. We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.