Related papers: Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control

Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control

URL: http://arxiv.org/abs/2508.13922v1
Date: Tue, 19 Aug 2025 15:18:01 GMT
Title: Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control
Authors: SM Mazharul Islam, Manfred Huber,
Abstract summary: We introduce Categorical Policies to model multimodal behavior modes with an intermediate categorical distribution.<n>By utilizing a latent categorical distribution to select the behavior mode, our approach naturally expresses multimodality while remaining fully differentiable via the sampling tricks.<n>Our results indicate that the Categorical distribution serves as a powerful tool for structured exploration and multimodal behavior representation in continuous control.
Score: 1.7495213911983414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A policy in deep reinforcement learning (RL), either deterministic or stochastic, is commonly parameterized as a Gaussian distribution alone, limiting the learned behavior to be unimodal. However, the nature of many practical decision-making problems favors a multimodal policy that facilitates robust exploration of the environment and thus to address learning challenges arising from sparse rewards, complex dynamics, or the need for strategic adaptation to varying contexts. This issue is exacerbated in continuous control domains where exploration usually takes place in the vicinity of the predicted optimal action, either through an additive Gaussian noise or the sampling process of a stochastic policy. In this paper, we introduce Categorical Policies to model multimodal behavior modes with an intermediate categorical distribution, and then generate output action that is conditioned on the sampled mode. We explore two sampling schemes that ensure differentiable discrete latent structure while maintaining efficient gradient-based optimization. By utilizing a latent categorical distribution to select the behavior mode, our approach naturally expresses multimodality while remaining fully differentiable via the sampling tricks. We evaluate our multimodal policy on a set of DeepMind Control Suite environments, demonstrating that through better exploration, our learned policies converge faster and outperform standard Gaussian policies. Our results indicate that the Categorical distribution serves as a powerful tool for structured exploration and multimodal behavior representation in continuous control.

Related papers

Decision Flow Policy Optimization [53.825268058199825]
We show that generative models can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces.<n>Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets.<n>We propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization.
arXiv Detail & Related papers (2025-05-26T03:42:20Z)
Learning on One Mode: Addressing Multi-modality in Offline Reinforcement Learning [9.38848713730931]
offline reinforcement learning seeks to learn optimal policies from static datasets without interacting with the environment.<n>Existing methods often assume unimodal behaviour policies, leading to suboptimal performance when this assumption is violated.<n>We propose weighted imitation Learning on One Mode (LOM), a novel approach that focuses on learning from a single, promising mode of the behaviour policy.
arXiv Detail & Related papers (2024-12-04T11:57:36Z)
Sampling from Energy-based Policies using Diffusion [14.542411354617983]
We introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function.<n>We show that our approach enhances sample efficiency in continuous control tasks and captures multimodal behaviors, addressing key limitations of existing methods.
arXiv Detail & Related papers (2024-10-02T08:09:33Z)
Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning [20.48276559928517]
We introduce a straightforward architecture that constrains the discrete policy to be unimodal using Poisson probability distributions. We conduct experiments to show that the discrete policy with the unimodal probability distribution provides significantly faster convergence and higher performance for on-policy reinforcement learning algorithms.
arXiv Detail & Related papers (2024-08-01T06:06:53Z)
Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient [26.675822002049372]
Deep Diffusion Policy Gradient (DDiffPG) is a novel actor-critic algorithm that learns from scratch multimodal policies. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes.
arXiv Detail & Related papers (2024-06-02T09:32:28Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces. We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z)
Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment. We first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z)
CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
We consider and study a distribution of optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
arXiv Detail & Related papers (2022-05-19T09:48:56Z)
Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents. We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively. We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.