Unified Policy Optimization for Continuous-action Reinforcement Learning
in Non-stationary Tasks and Games
- URL: http://arxiv.org/abs/2208.09452v1
- Date: Fri, 19 Aug 2022 17:12:31 GMT
- Title: Unified Policy Optimization for Continuous-action Reinforcement Learning
in Non-stationary Tasks and Games
- Authors: Rong-Jun Qin, Fan-Ming Luo, Hong Qian, Yang Yu
- Abstract summary: This paper addresses learning in non-stationary environments and games with continuous actions.
We prove that PORL has a last-iterate convergence algorithm, which is important for adversarial and cooperative games.
- Score: 6.196828712245427
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses policy learning in non-stationary environments and games
with continuous actions. Rather than the classical reward maximization
mechanism, inspired by the ideas of follow-the-regularized-leader (FTRL) and
mirror descent (MD) update, we propose a no-regret style reinforcement learning
algorithm PORL for continuous action tasks. We prove that PORL has a
last-iterate convergence guarantee, which is important for adversarial and
cooperative games. Empirical studies show that, in stationary environments such
as MuJoCo locomotion controlling tasks, PORL performs equally well as, if not
better than, the soft actor-critic (SAC) algorithm; in non-stationary
environments including dynamical environments, adversarial training, and
competitive games, PORL is superior to SAC in both a better final policy
performance and a more stable training process.
Related papers
- Curriculum Learning With Counterfactual Group Relative Policy Advantage For Multi-Agent Reinforcement Learning [15.539607264374242]
Multi-agent reinforcement learning (MARL) has achieved strong performance in cooperative adversarial tasks.<n>We propose a dynamic curriculum learning framework that employs an self-adaptive difficulty adjustment mechanism.<n>Our method improves both training stability and final performance, achieving competitive results against state-of-the-art methods.
arXiv Detail & Related papers (2025-06-09T08:38:18Z) - Action-Adaptive Continual Learning: Enabling Policy Generalization under Dynamic Action Spaces [16.07372335607339]
Continual Learning (CL) is a powerful tool that enables agents to learn a sequence of tasks.<n>Existing CL methods often assume that the agent's capabilities remain static within dynamic environments.<n>We propose an Action-Adaptive Continual Learning framework (AACL) to address this challenge.
arXiv Detail & Related papers (2025-06-06T03:07:30Z) - The Cell Must Go On: Agar.io for Continual Reinforcement Learning [9.034912115190034]
Continual reinforcement learning (RL) concerns agents that are expected to learn continually, rather than converge to a policy that is then fixed for evaluation.<n>We introduce AgarCL, a research platform for continual RL that allows for a progression of increasingly sophisticated behaviour.
arXiv Detail & Related papers (2025-05-23T20:09:27Z) - Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.
Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.
We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z) - Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport [45.793758222754036]
Diffusion policies have shown promise in learning complex behaviors from demonstrations.
This paper explores improving diffusion-based imitation learning models through online interactions with the environment.
We propose OTPR, a novel method that integrates diffusion policies with RL using optimal transport theory.
arXiv Detail & Related papers (2025-02-18T08:22:20Z) - Explore Reinforced: Equilibrium Approximation with Reinforcement Learning [3.214961078500366]
We introduce Exp3-IXrl - a blend of RL and game-theoretic approach, separating the RL agent's action selection from the equilibrium.
We demonstrate that our algorithm expands the application of equilibrium approximation algorithms to new environments.
arXiv Detail & Related papers (2024-12-02T22:37:59Z) - Toward Optimal LLM Alignments Using Two-Player Games [86.39338084862324]
In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent.
We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents.
Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
arXiv Detail & Related papers (2024-06-16T15:24:50Z) - EvIL: Evolution Strategies for Generalisable Imitation Learning [33.745657379141676]
In imitation learning (IL) expert demonstrations and the environment we want to deploy our learned policy in aren't exactly the same.
Compared to policy-centric approaches to IL like cloning, reward-centric approaches like inverse reinforcement learning (IRL) often better replicate expert behaviour in new environments.
We find that modern deep IL algorithms frequently recover rewards which induce policies far weaker than the expert, even in the same environment the demonstrations were collected in.
We propose a novel evolution-strategies based method EvIL to optimise for a reward-shaping term that speeds up re-training in the target environment.
arXiv Detail & Related papers (2024-06-15T22:46:39Z) - Solving a Real-World Optimization Problem Using Proximal Policy Optimization with Curriculum Learning and Reward Engineering [0.8602553195689513]
We present a proximal policy optimization (PPO) agent trained through curriculum learning (CL) principles and meticulous reward engineering.
Our work addresses the challenge of effectively balancing the competing objectives of operational safety, volume optimization, and minimizing resource usage.
Results demonstrate that our approach significantly improves inference-time safety, achieving near-zero safety violations in addition to enhancing waste sorting plant efficiency.
arXiv Detail & Related papers (2024-04-03T08:53:42Z) - Learning to Sail Dynamic Networks: The MARLIN Reinforcement Learning
Framework for Congestion Control in Tactical Environments [53.08686495706487]
This paper proposes an RL framework that leverages an accurate and parallelizable emulation environment to reenact the conditions of a tactical network.
We evaluate our RL learning framework by training a MARLIN agent in conditions replicating a bottleneck link transition between a Satellite Communication (SATCOM) and an UHF Wide Band (UHF) radio link.
arXiv Detail & Related papers (2023-06-27T16:15:15Z) - TASAC: a twin-actor reinforcement learning framework with stochastic
policy for batch process control [1.101002667958165]
Reinforcement Learning (RL) wherein an agent learns the policy by directly interacting with the environment, offers a potential alternative in this context.
RL frameworks with actor-critic architecture have recently become popular for controlling systems where state and action spaces are continuous.
It has been shown that an ensemble of actor and critic networks further helps the agent learn better policies due to the enhanced exploration due to simultaneous policy learning.
arXiv Detail & Related papers (2022-04-22T13:00:51Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs.
We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z) - Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses.
We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z) - Context-Based Soft Actor Critic for Environments with Non-stationary
Dynamics [8.318823695156974]
We propose the Latent Context-based Soft Actor Critic (LC-SAC) method to address aforementioned issues.
By minimizing the contrastive prediction loss function, the learned context variables capture the information of the environment dynamics and the recent behavior of the agent.
Experimental results show that the performance of LC-SAC is significantly better than the SAC algorithm on the MetaWorld ML1 tasks.
arXiv Detail & Related papers (2021-05-07T15:00:59Z) - Continuous Coordination As a Realistic Scenario for Lifelong Learning [6.044372319762058]
We introduce a multi-agent lifelong learning testbed that supports both zero-shot and few-shot settings.
We evaluate several recent MARL methods, and benchmark state-of-the-art LLL algorithms in limited memory and computation.
We empirically show that the agents trained in our setup are able to coordinate well with unseen agents, without any additional assumptions made by previous works.
arXiv Detail & Related papers (2021-03-04T18:44:03Z) - Robust Reinforcement Learning on State Observations with Learned Optimal
Adversary [86.0846119254031]
We study the robustness of reinforcement learning with adversarially perturbed state observations.
With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found.
For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones.
arXiv Detail & Related papers (2021-01-21T05:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.