Reward-Free Policy Space Compression for Reinforcement Learning
- URL: http://arxiv.org/abs/2202.11079v1
- Date: Tue, 22 Feb 2022 18:11:57 GMT
- Title: Reward-Free Policy Space Compression for Reinforcement Learning
- Authors: Mirco Mutti, Stefano Del Col, Marcello Restelli
- Abstract summary: In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies.
We seek for a reward-free compression of the policy space into a finite set of representative policies.
We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard.
- Score: 39.04317877999891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning, we encode the potential behaviors of an agent
interacting with an environment into an infinite set of policies, the policy
space, typically represented by a family of parametric functions. Dealing with
such a policy space is a hefty challenge, which often causes sample and
computation inefficiencies. However, we argue that a limited number of policies
are actually relevant when we also account for the structure of the environment
and of the policy parameterization, as many of them would induce very similar
interactions, i.e., state-action distributions. In this paper, we seek for a
reward-free compression of the policy space into a finite set of representative
policies, such that, given any policy $\pi$, the minimum R\'enyi divergence
between the state-action distributions of the representative policies and the
state-action distribution of $\pi$ is bounded. We show that this compression of
the policy space can be formulated as a set cover problem, and it is inherently
NP-hard. Nonetheless, we propose a game-theoretic reformulation for which a
locally optimal solution can be efficiently found by iteratively stretching the
compressed space to cover an adversarial policy. Finally, we provide an
empirical evaluation to illustrate the compression procedure in simple domains,
and its ripple effects in reinforcement learning.
Related papers
- Performance of NPG in Countable State-Space Average-Cost RL [12.949520455740092]
We consider policy optimization methods in reinforcement learning settings where the state space is arbitrarily large.
The motivation arises from control problems in communication networks, matching markets, and other queueing systems.
arXiv Detail & Related papers (2024-05-30T20:29:52Z) - Policy-Space Search: Equivalences, Improvements, and Compression [5.801044612920816]
Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty.
A* with Non-Determinism (AND*) is a FOND planner that generalizes A* for FOND planning.
arXiv Detail & Related papers (2024-03-28T23:40:20Z) - Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations.
We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL.
We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via
Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning.
We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence.
We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z) - Policy Optimization as Online Learning with Mediator Feedback [46.845765216238135]
Policy Optimization (PO) is a widely used approach to address continuous control tasks.
In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space.
We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
arXiv Detail & Related papers (2020-12-15T11:34:29Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.