Related papers: Reward-Free Policy Space Compression for Reinforcement Learning

Reward-Free Policy Space Compression for Reinforcement Learning

URL: http://arxiv.org/abs/2202.11079v1
Date: Tue, 22 Feb 2022 18:11:57 GMT
Title: Reward-Free Policy Space Compression for Reinforcement Learning
Authors: Mirco Mutti, Stefano Del Col, Marcello Restelli
Abstract summary: In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies. We seek for a reward-free compression of the policy space into a finite set of representative policies. We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard.
Score: 39.04317877999891
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies, the policy space, typically represented by a family of parametric functions. Dealing with such a policy space is a hefty challenge, which often causes sample and computation inefficiencies. However, we argue that a limited number of policies are actually relevant when we also account for the structure of the environment and of the policy parameterization, as many of them would induce very similar interactions, i.e., state-action distributions. In this paper, we seek for a reward-free compression of the policy space into a finite set of representative policies, such that, given any policy $\pi$, the minimum R\'enyi divergence between the state-action distributions of the representative policies and the state-action distribution of $\pi$ is bounded. We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard. Nonetheless, we propose a game-theoretic reformulation for which a locally optimal solution can be efficiently found by iteratively stretching the compressed space to cover an adversarial policy. Finally, we provide an empirical evaluation to illustrate the compression procedure in simple domains, and its ripple effects in reinforcement learning.

Related papers

Fat-to-Thin Policy Optimization: Offline RL with Sparse Policies [5.5938591697033555]
Sparse continuous policies are distributions that choose some actions at random yet keep strictly zero probability for the other actions. In this paper, we propose the first offline policy optimization algorithm that tackles this challenge: Fat-to-Thin Policy Optimization (FtTPO) We instantiate FtTPO with the general $q$-Gaussian family that encompasses both heavy-tailed and sparse policies.
arXiv Detail & Related papers (2025-01-24T10:11:48Z)
Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z)
Policy-Space Search: Equivalences, Improvements, and Compression [5.801044612920816]
Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. A* with Non-Determinism (AND*) is a FOND planner that generalizes A* for FOND planning.
arXiv Detail & Related papers (2024-03-28T23:40:20Z)
Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations. We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL. We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment. We first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z)
Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning. We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence. We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z)
Policy Optimization as Online Learning with Mediator Feedback [46.845765216238135]
Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
arXiv Detail & Related papers (2020-12-15T11:34:29Z)
Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents. We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively. We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.