SoftDICE for Imitation Learning: Rethinking Off-policy Distribution
Matching
- URL: http://arxiv.org/abs/2106.03155v1
- Date: Sun, 6 Jun 2021 15:37:11 GMT
- Title: SoftDICE for Imitation Learning: Rethinking Off-policy Distribution
Matching
- Authors: Mingfei Sun, Anuj Mahajan, Katja Hofmann, Shimon Whiteson
- Abstract summary: SoftDICE achieves state-of-the-art performance for imitation learning.
We present SoftDICE, which achieves state-of-the-art performance for imitation learning.
- Score: 61.20581291619333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present SoftDICE, which achieves state-of-the-art performance for
imitation learning. SoftDICE fixes several key problems in ValueDICE, an
off-policy distribution matching approach for sample-efficient imitation
learning. Specifically, the objective of ValueDICE contains logarithms and
exponentials of expectations, for which the mini-batch gradient estimate is
always biased. Second, ValueDICE regularizes the objective with replay buffer
samples when expert demonstrations are limited in number, which however changes
the original distribution matching problem. Third, the re-parametrization trick
used to derive the off-policy objective relies on an implicit assumption that
rarely holds in training. We leverage a novel formulation of distribution
matching and consider an entropy-regularized off-policy objective, which yields
a completely offline algorithm called SoftDICE. Our empirical results show that
SoftDICE recovers the expert policy with only one demonstration trajectory and
no further on-policy/off-policy samples. SoftDICE also stably outperforms
ValueDICE and other baselines in terms of sample efficiency on Mujoco benchmark
tasks.
Related papers
- Primal-Dual Spectral Representation for Off-policy Evaluation [39.24759979398673]
Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL)
We show that our algorithm, SpectralDICE, is both primal and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.
arXiv Detail & Related papers (2024-10-23T03:38:31Z) - Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning [43.74071631716718]
We show that DICE-based methods can be viewed as a transformation from the behavior distribution to the optimal policy distribution.
We propose a novel approach, Diffusion-DICE, that directly performs this transformation using diffusion models.
arXiv Detail & Related papers (2024-07-29T15:36:42Z) - Reward-Punishment Reinforcement Learning with Maximum Entropy [3.123049150077741]
We introduce the soft Deep MaxPain'' (softDMP) algorithm, which integrates the optimization of long-term policy entropy into reward-punishment reinforcement learning objectives.
Our motivation is to facilitate a smoother variation of operators utilized in the updating of action values beyond traditional max'' and min'' operators.
arXiv Detail & Related papers (2024-05-20T05:05:14Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - A Simple Solution for Offline Imitation from Observations and Examples
with Possibly Incomplete Trajectories [122.11358440078581]
offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable.
We propose Trajectory-Aware Learning from Observations (TAILO) to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available.
arXiv Detail & Related papers (2023-11-02T15:41:09Z) - Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement
Learning [44.50394347326546]
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning.
Off-policy bias is corrected in a per-decision manner, but once a trace has been fully cut, the effect cannot be reversed.
We propose a multistep operator that can express both per-decision and trajectory-aware methods.
arXiv Detail & Related papers (2023-01-26T18:57:41Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Multi-Scale Positive Sample Refinement for Few-Shot Object Detection [61.60255654558682]
Few-shot object detection (FSOD) helps detectors adapt to unseen classes with few training instances.
We propose a Multi-scale Positive Sample Refinement (MPSR) approach to enrich object scales in FSOD.
MPSR generates multi-scale positive samples as object pyramids and refines the prediction at various scales.
arXiv Detail & Related papers (2020-07-18T09:48:29Z) - Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment.
One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting.
But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient.
We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.