Towards A Unified Policy Abstraction Theory and Representation Learning
Approach in Markov Decision Processes
- URL: http://arxiv.org/abs/2209.07696v1
- Date: Fri, 16 Sep 2022 03:41:50 GMT
- Title: Towards A Unified Policy Abstraction Theory and Representation Learning
Approach in Markov Decision Processes
- Authors: Min Zhang, Hongyao Tang, Jianye Hao, Yan Zheng
- Abstract summary: We propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels.
We then generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies.
For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively.
- Score: 39.94472154078338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lying on the heart of intelligent decision-making systems, how policy is
represented and optimized is a fundamental problem. The root challenge in this
problem is the large scale and the high complexity of policy space, which
exacerbates the difficulty of policy learning especially in real-world
scenarios. Towards a desirable surrogate policy space, recently policy
representation in a low-dimensional latent space has shown its potential in
improving both the evaluation and optimization of policy. The key question
involved in these studies is by what criterion we should abstract the policy
space for desired compression and generalization. However, both the theory on
policy abstraction and the methodology on policy representation learning are
less studied in the literature. In this work, we make very first efforts to
fill up the vacancy. First, we propose a unified policy abstraction theory,
containing three types of policy abstraction associated to policy features at
different levels. Then, we generalize them to three policy metrics that
quantify the distance (i.e., similarity) of policies, for more convenient use
in learning policy representation. Further, we propose a policy representation
learning approach based on deep metric learning. For the empirical study, we
investigate the efficacy of the proposed policy metrics and representations, in
characterizing policy difference and conveying policy generalization
respectively. Our experiments are conducted in both policy optimization and
evaluation problems, containing trust-region policy optimization (TRPO),
diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE).
Somewhat naturally, the experimental results indicate that there is no a
universally optimal abstraction for all downstream learning problems; while the
influence-irrelevance policy abstraction can be a generally preferred choice.
Related papers
- Value Enhancement of Reinforcement Learning via Efficient and Robust
Trust Region Optimization [14.028916306297928]
Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy.
We propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms.
arXiv Detail & Related papers (2023-01-05T18:43:40Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
We consider and study a distribution of optimal policies.
In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems.
We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
arXiv Detail & Related papers (2022-05-19T09:48:56Z) - PG3: Policy-Guided Planning for Generalized Policy Generation [25.418642034856365]
We study generalized policy search-based methods with a focus on the score function used to guide the search over policies.
The main idea behind our approach is that a candidate policy should be used to guide planning on training problems as a mechanism for evaluating that candidate.
Empirical results in six domains confirm that PG3 learns generalized policies more efficiently and effectively than several baselines.
arXiv Detail & Related papers (2022-04-21T21:59:25Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep
Reinforcement Learning [9.014110264448371]
We propose a novel unsupervised learning approach named goal-conditioned policy with intrinsic motivation (GPIM)
GPIM jointly learns both an abstract-level policy and a goal-conditioned policy.
Experiments on various robotic tasks demonstrate the effectiveness and efficiency of our proposed GPIM method.
arXiv Detail & Related papers (2021-04-11T16:26:10Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding.
Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.