Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization
- URL: http://arxiv.org/abs/2601.04441v1
- Date: Wed, 07 Jan 2026 22:57:21 GMT
- Title: Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization
- Authors: Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab,
- Abstract summary: Reinforcement learning in discrete action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations.<n>Existing approaches either simplify policy learning by assuming independence across sub-actions, or attempt to learn action structure and control jointly.<n>We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control.
- Score: 11.646124619395486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning in discrete combinatorial action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations. Existing approaches either simplify policy learning by assuming independence across sub-actions, which often yields incoherent or invalid actions, or attempt to learn action structure and control jointly, which is slow and unstable. We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control. On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over the state of the art while reducing time to convergence by up to 12.8$\times$.
Related papers
- Primary-Fine Decoupling for Action Generation in Robotic Imitation [91.2899765310853]
Multi-modal distribution in robotic manipulation action sequences poses critical challenges for imitation learning.<n>We propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage framework that decouples coarse action consistency from fine-grained variations.<n>PF-DAG outperforms state-of-the-art baselines across 56 tasks from Adroit, DexArt, and MetaWorld benchmarks.
arXiv Detail & Related papers (2026-02-25T08:36:45Z) - Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete and Hybrid Action Spaces [4.395837214164745]
We propose Distance-Guided Reinforcement Learning (DGRL) to enable efficient RL in spaces with up to 10$text20$ actions.<n>We demonstrate performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments.
arXiv Detail & Related papers (2026-02-09T13:05:07Z) - Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization [2.595968385299781]
Multi-objective reinforcement learning seeks to learn policies that balance multiple, often conflicting objectives.<n>We introduce $D3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly.<n>$D3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization.
arXiv Detail & Related papers (2026-02-08T01:45:01Z) - Integrating Diverse Assignment Strategies into DETRs [61.61489761918158]
Label assignment is a critical component in object detectors, particularly within DETR-style frameworks.<n>We propose LoRA-DETR, a flexible and lightweight framework that seamlessly integrates diverse assignment strategies into any DETR-style detector.
arXiv Detail & Related papers (2026-01-14T07:28:54Z) - Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces [57.466101098183884]
Reinforcement learning (RL) struggles to scale to large, action spaces common in many real-world problems.<n>This paper introduces a novel framework for training discrete diffusion models as highly effective policies in complex settings.
arXiv Detail & Related papers (2025-09-26T21:53:36Z) - Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient [61.440209025381016]
Policy reinforcement learning frequently utilize deep neural networks (DNNs) to learn a shared backbone of feature representations used to compute likelihoods in an action selection layer.<n>We show that under certain constraints, a structure resembling neural collapse, which we refer to as Action Collapse (AC), emerges.<n>We propose the Action Collapse Policy Gradient (ACPG) method, which accordingly affixes a synthetic ETF as our action selection layer.
arXiv Detail & Related papers (2025-09-02T18:33:11Z) - SAINT: Attention-Based Modeling of Sub-Action Dependencies in Multi-Action Policies [13.673494183777716]
Sub-Action Interaction Network (SAINT) is a novel policy architecture that represents multi-component actions as unordered sets and models their dependencies via self-attention conditioned on the global state.<n>In 15 distinct environments across three task domains, including environments with nearly 17 million joint actions, SAINT consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-17T18:34:31Z) - Offline Multi-agent Reinforcement Learning via Score Decomposition [51.23590397383217]
offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts.<n>This work is the first work to explicitly address the distributional gap between offline and online MARL.
arXiv Detail & Related papers (2025-05-09T11:42:31Z) - Reinforcement learning with combinatorial actions for coupled restless bandits [62.89013331120493]
We propose SEQUOIA, an RL algorithm that directly optimize for long-term reward over the feasible action space.<n>We empirically validate SEQUOIA on four novel restless bandit problems with constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints.
arXiv Detail & Related papers (2025-03-01T21:25:21Z) - Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator.<n>We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation.<n>Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Chain-of-Thought Predictive Control [32.30974063877643]
We study generalizable policy learning from demonstrations for complex low-level control.
We propose a novel hierarchical imitation learning method that utilizes sub-optimal demos.
arXiv Detail & Related papers (2023-04-03T07:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.