Decision-Aware Actor-Critic with Function Approximation and Theoretical
Guarantees
- URL: http://arxiv.org/abs/2305.15249v2
- Date: Tue, 31 Oct 2023 01:33:36 GMT
- Title: Decision-Aware Actor-Critic with Function Approximation and Theoretical
Guarantees
- Authors: Sharan Vaswani, Amirreza Kazemi, Reza Babanezhad, Nicolas Le Roux
- Abstract summary: Actor-critic (AC) methods are widely used in reinforcement learning (RL)
We design a joint objective for training the actor and critic in a decision-aware fashion.
We empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.
- Score: 12.259191000019033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Actor-critic (AC) methods are widely used in reinforcement learning (RL) and
benefit from the flexibility of using any policy gradient method as the actor
and value-based method as the critic. The critic is usually trained by
minimizing the TD error, an objective that is potentially decorrelated with the
true goal of achieving a high reward with the actor. We address this mismatch
by designing a joint objective for training the actor and critic in a
decision-aware fashion. We use the proposed objective to design a generic, AC
algorithm that can easily handle any function approximation. We explicitly
characterize the conditions under which the resulting algorithm guarantees
monotonic policy improvement, regardless of the choice of the policy and critic
parameterization. Instantiating the generic algorithm results in an actor that
involves maximizing a sequence of surrogate functions (similar to TRPO, PPO)
and a critic that involves minimizing a closely connected objective. Using
simple bandit examples, we provably establish the benefit of the proposed
critic objective over the standard squared error. Finally, we empirically
demonstrate the benefit of our decision-aware actor-critic framework on simple
RL problems.
Related papers
- Solving Continuous Control via Q-learning [54.05120662838286]
We show that a simple modification of deep Q-learning largely alleviates issues with actor-critic methods.
By combining bang-bang action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods.
arXiv Detail & Related papers (2022-10-22T22:55:50Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Provable Benefits of Actor-Critic Methods for Offline Reinforcement
Learning [85.50033812217254]
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically.
We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
arXiv Detail & Related papers (2021-08-19T17:27:29Z) - Analysis of a Target-Based Actor-Critic Algorithm with Linear Function
Approximation [2.1592777170316366]
Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning.
We bridge this gap by proposing the first theoretical analysis of an online target-based actor-critic with linear function approximation in the discounted reward setting.
arXiv Detail & Related papers (2021-06-14T14:59:05Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Learning Value Functions in Deep Policy Gradients using Residual
Variance [22.414430270991005]
Policy gradient algorithms have proven to be successful in diverse decision making and control tasks.
Traditional actor-critic algorithms do not succeed in fitting the true value function.
We provide a new state-value (resp. state-action-value) function approximation that learns the value of the states relative to their mean value.
arXiv Detail & Related papers (2020-10-09T08:57:06Z) - Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy [122.01837436087516]
We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms.
We establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time.
arXiv Detail & Related papers (2020-08-02T14:01:49Z) - How to Learn a Useful Critic? Model-based Action-Gradient-Estimator
Policy Optimization [10.424426548124696]
We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients.
MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning.
We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
arXiv Detail & Related papers (2020-04-29T16:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.