Related papers: Evolutionary Stochastic Policy Distillation

Evolutionary Stochastic Policy Distillation

URL: http://arxiv.org/abs/2004.12909v2
Date: Thu, 30 Apr 2020 11:00:24 GMT
Title: Evolutionary Stochastic Policy Distillation
Authors: Hao Sun, Xinyu Pan, Bo Dai, Dahua Lin, Bolei Zhou
Abstract summary: We propose a new method called Evolutionary Policy Distillation (ESPD) to solve GCRS tasks. ESPD enables a target policy to learn from a series of its variants through the technique of policy distillation (PD) The experiments based on the MuJoCo control suite show the high learning efficiency of the proposed method.
Score: 139.54121001226451
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Solving the Goal-Conditioned Reward Sparse (GCRS) task is a challenging reinforcement learning problem due to the sparsity of reward signals. In this work, we propose a new formulation of GCRS tasks from the perspective of the drifted random walk on the state space, and design a novel method called Evolutionary Stochastic Policy Distillation (ESPD) to solve them based on the insight of reducing the First Hitting Time of the stochastic process. As a self-imitate approach, ESPD enables a target policy to learn from a series of its stochastic variants through the technique of policy distillation (PD). The learning mechanism of ESPD can be considered as an Evolution Strategy (ES) that applies perturbations upon policy directly on the action space, with a SELECT function to check the superiority of stochastic variants and then use PD to update the policy. The experiments based on the MuJoCo robotics control suite show the high learning efficiency of the proposed method.

Related papers

Learning Deterministic Policies with Policy Gradients in Constrained Markov Decision Processes [59.27926064817273]
We introduce an exploration-agnostic algorithm, called C-PG, which enjoys global last-iterate convergence guarantees under domination assumptions.<n>We empirically validate both the action-based (C-PGAE) and parameter-based (C-PGPE) variants of C-PG on constrained control tasks.
arXiv Detail & Related papers (2025-06-06T10:29:05Z)
Guided Reinforcement Learning for Robust Multi-Contact Loco-Manipulation [12.377289165111028]
Reinforcement learning (RL) often necessitates a meticulous Markov Decision Process (MDP) design tailored to each task. This work proposes a systematic approach to behavior synthesis and control for multi-contact loco-manipulation tasks. We define a task-independent MDP to train RL policies using only a single demonstration per task generated from a model-based trajectory.
arXiv Detail & Related papers (2024-10-17T17:46:27Z)
Distilling Reinforcement Learning Policies for Interpretable Robot Locomotion: Gradient Boosting Machines and Symbolic Regression [53.33734159983431]
This paper introduces a novel approach to distill neural RL policies into more interpretable forms. We train expert neural network policies using RL and distill them into (i) GBMs, (ii) EBMs, and (iii) symbolic policies.
arXiv Detail & Related papers (2024-03-21T11:54:45Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
On the Value of Myopic Behavior in Policy Reuse [67.37788288093299]
Leveraging learned strategies in unfamiliar scenarios is fundamental to human intelligence. In this work, we present a framework called Selective Myopic bEhavior Control(SMEC) SMEC adaptively aggregates the sharable short-term behaviors of prior policies and the long-term behaviors of the task policy, leading to coordinated decisions.
arXiv Detail & Related papers (2023-05-28T03:59:37Z)
Coherent Soft Imitation Learning [17.345411907902932]
Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward. This work derives an imitation method that captures the strengths of both BC and IRL.
arXiv Detail & Related papers (2023-05-25T21:54:22Z)
Learning to Plan via a Multi-Step Policy Regression Method [6.452233509848456]
We propose a new approach to increase inference performance in environments that require a specific sequence of actions in order to be solved. Instead of learning a policy for a single step, we want to learn a policy that can predict n actions in advance. We test our method on the MiniGrid and Pong environments and show drastic speedup during inference time by successfully predicting sequences of actions on a single observation.
arXiv Detail & Related papers (2021-06-18T11:51:49Z)
Evolutionary Selective Imitation: Interpretable Agents by Imitation Learning Without a Demonstrator [1.370633147306388]
We propose a new method for training an agent via an evolutionary strategy (ES) In every iteration we replace a subset of the samples with samples from the best trajectories discovered so far. The evaluation procedure for this set is to train, via supervised learning, a randomly initialised neural network (NN) to imitate the set.
arXiv Detail & Related papers (2020-09-17T16:25:31Z)
Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment. One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting. But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient. We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.