Related papers: Implicitly Regularized RL with Implicit Q-Values

Implicitly Regularized RL with Implicit Q-Values

URL: http://arxiv.org/abs/2108.07041v1
Date: Mon, 16 Aug 2021 12:20:47 GMT
Title: Implicitly Regularized RL with Implicit Q-Values
Authors: Nino Vieillard, Marcin Andrychowicz, Anton Raichuk, Olivier Pietquin, Matthieu Geist
Abstract summary: The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy. We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function. We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
Score: 42.87920755961722
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to $Q$. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax cannot be computed exactly otherwise. Especially the usage of function approximation, to deal with continuous action spaces in modern actor-critic architectures, intrinsically prevents the exact computation of a softmax. We propose to alleviate this issue by parametrizing the $Q$-function implicitly, as the sum of a log-policy and of a value function. We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value. We provide a theoretical analysis of our algorithm: from an Approximate Dynamic Programming perspective, we show its equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results. We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods.

Related papers

Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees [16.01190705000295]
Constrained decision-making is essential for designing safe policies in real-world control systems.<n>We propose a novel technique that effectively minimizes the constraint value function--to satisfy the constraints.<n>We prove that such an algorithm finds a policy with at most $epsilon$ sub-optimality and feasible policy after $O(epsilon-2)$.
arXiv Detail & Related papers (2025-05-25T17:27:06Z)
Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation [24.299769025346368]
We study the reinforcement learning problem in a constrained decision process (CMDP)<n>We propose an RL algorithm for linear CMDPs that achieves $tildemathcalO(sqrtK)$ regret with an episode-wise zero-violation guarantee.<n>Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
arXiv Detail & Related papers (2025-02-14T13:07:25Z)
Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives. In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z)
On the Model-Misspecification in Reinforcement Learning [9.864462523050843]
We present a unified theoretical framework for addressing model misspecification in reinforcement learning. We show that value-based and model-based methods can achieve robustness under local misspecification error bounds. We also propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of $zeta$.
arXiv Detail & Related papers (2023-06-19T04:31:59Z)
Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z)
Smoothing Policy Iteration for Zero-sum Markov Games [9.158672246275348]
We propose the smoothing policy robustness (SPI) algorithm to solve the zero-sum MGs approximately. Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces. We also propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations.
arXiv Detail & Related papers (2022-12-03T14:39:06Z)
Confident Approximate Policy Iteration for Efficient Local Planning in $q^\pi$-realizable MDPs [2.5652904661855076]
We consider approximate dynamic programming in $gamma$-discounted Markov decision processes. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI) CAPI computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worst-case approximation error $epsilon$ of the action-value functions of stationary policies.
arXiv Detail & Related papers (2022-10-27T20:19:31Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks. The algorithm consistently generates control policies that outperform state-of-arts in literature. A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z)
Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation. We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure. Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.