On the Value of Myopic Behavior in Policy Reuse
- URL: http://arxiv.org/abs/2305.17623v1
- Date: Sun, 28 May 2023 03:59:37 GMT
- Title: On the Value of Myopic Behavior in Policy Reuse
- Authors: Kang Xu, Chenjia Bai, Shuang Qiu, Haoran He, Bin Zhao, Zhen Wang, Wei
Li, Xuelong Li
- Abstract summary: Leveraging learned strategies in unfamiliar scenarios is fundamental to human intelligence.
In this work, we present a framework called Selective Myopic bEhavior Control(SMEC)
SMEC adaptively aggregates the sharable short-term behaviors of prior policies and the long-term behaviors of the task policy, leading to coordinated decisions.
- Score: 67.37788288093299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Leveraging learned strategies in unfamiliar scenarios is fundamental to human
intelligence. In reinforcement learning, rationally reusing the policies
acquired from other tasks or human experts is critical for tackling problems
that are difficult to learn from scratch. In this work, we present a framework
called Selective Myopic bEhavior Control~(SMEC), which results from the insight
that the short-term behaviors of prior policies are sharable across tasks. By
evaluating the behaviors of prior policies via a hybrid value function
architecture, SMEC adaptively aggregates the sharable short-term behaviors of
prior policies and the long-term behaviors of the task policy, leading to
coordinated decisions. Empirical results on a collection of manipulation and
locomotion tasks demonstrate that SMEC outperforms existing methods, and
validate the ability of SMEC to leverage related prior policies.
Related papers
- Active Fine-Tuning of Generalist Policies [54.65568433408307]
We propose AMF (Active Multi-task Fine-tuning) to maximize multi-task policy performance under a limited demonstration budget.
We derive performance guarantees for AMF under regularity assumptions and demonstrate its empirical effectiveness in complex and high-dimensional environments.
arXiv Detail & Related papers (2024-10-07T13:26:36Z) - Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning [7.085987593010675]
This work investigates the offline formulation of the contextual bandit problem.
The goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies.
We introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators.
arXiv Detail & Related papers (2024-05-23T09:07:27Z) - Offline Reinforcement Learning with On-Policy Q-Function Regularization [57.09073809901382]
We deal with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy.
We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.
arXiv Detail & Related papers (2023-07-25T21:38:08Z) - Fast Model-based Policy Search for Universal Policy Networks [45.44896435487879]
Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning.
We propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment.
We integrate this prior with a Bayesian optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network.
arXiv Detail & Related papers (2022-02-11T18:08:02Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Continuous Action Reinforcement Learning from a Mixture of Interpretable
Experts [35.80418547105711]
We propose a policy scheme that retains a complex function approxor for its internal value predictions but constrains the policy to have a concise, hierarchical, and human-readable structure.
The main technical contribution of the paper is to address the challenges introduced by this non-differentiable state selection procedure.
arXiv Detail & Related papers (2020-06-10T16:02:08Z) - Evolutionary Stochastic Policy Distillation [139.54121001226451]
We propose a new method called Evolutionary Policy Distillation (ESPD) to solve GCRS tasks.
ESPD enables a target policy to learn from a series of its variants through the technique of policy distillation (PD)
The experiments based on the MuJoCo control suite show the high learning efficiency of the proposed method.
arXiv Detail & Related papers (2020-04-27T16:19:25Z) - Learning Goal-oriented Dialogue Policy with Opposite Agent Awareness [116.804536884437]
We propose an opposite behavior aware framework for policy learning in goal-oriented dialogues.
We estimate the opposite agent's policy from its behavior and use this estimation to improve the target agent by regarding it as part of the target policy.
arXiv Detail & Related papers (2020-04-21T03:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.