Related papers: Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete and Hybrid Action Spaces

Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete and Hybrid Action Spaces

URL: http://arxiv.org/abs/2602.08616v1
Date: Mon, 09 Feb 2026 13:05:07 GMT
Title: Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete and Hybrid Action Spaces
Authors: Heiko Hoppe, Fabian Akkerman, Wouter van Heeswijk, Maximilian Schiffer,
Abstract summary: We propose Distance-Guided Reinforcement Learning (DGRL) to enable efficient RL in spaces with up to 10$text20$ actions.<n>We demonstrate performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments.
Score: 4.395837214164745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning is increasingly applied to logistics, scheduling, and recommender systems, but standard algorithms struggle with the curse of dimensionality in such large discrete action spaces. Existing algorithms typically rely on restrictive grid-based structures or computationally expensive nearest-neighbor searches, limiting their effectiveness in high-dimensional or irregularly structured domains. We propose Distance-Guided Reinforcement Learning (DGRL), combining Sampled Dynamic Neighborhoods (SDN) and Distance-Based Updates (DBU) to enable efficient RL in spaces with up to 10$^\text{20}$ actions. Unlike prior methods, SDN leverages a semantic embedding space to perform stochastic volumetric exploration, provably providing full support over a local trust region. Complementing this, DBU transforms policy optimization into a stable regression task, decoupling gradient variance from action space cardinality and guaranteeing monotonic policy improvement. DGRL naturally generalizes to hybrid continuous-discrete action spaces without requiring hierarchical dependencies. We demonstrate performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments, while simultaneously improving convergence speed and computational complexity.

Related papers

OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection [20.717476762904038]
We introduce OCTOPUS, a novel architecture that preserves both global context and local spatial structure within images.<n>OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions.<n>In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency.
arXiv Detail & Related papers (2026-01-31T21:12:59Z)
CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space [9.192754462575218]
We propose a textbfCooperative Hybrid Diffusion Policies (CHDP) framework to solve the hybrid action space problem.<n>CHDP employs two cooperative agents that leverage a discrete and a continuous diffusion policy, respectively.<n>On challenging hybrid action benchmarks, CHDP outperforms the state-of-the-art method by up to $19.3%$ in success rate.
arXiv Detail & Related papers (2026-01-09T09:50:47Z)
Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation [18.596128578766958]
Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics.<n>We propose an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment.<n>We provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.
arXiv Detail & Related papers (2025-12-22T02:12:04Z)
QoS-Aware Hierarchical Reinforcement Learning for Joint Link Selection and Trajectory Optimization in SAGIN-Supported UAV Mobility Management [52.15690855486153]
A space-air-ground integrated network (SAGIN) has emerged as an essential architecture for enabling ubiquitous UAV connectivity.<n>This paper formulates UAV mobility management in SAGIN as a constrained multiobjective joint optimization problem.
arXiv Detail & Related papers (2025-12-17T06:22:46Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z)
Decentralized Federated Reinforcement Learning for User-Centric Dynamic TFDD Control [37.54493447920386]
We propose a learning-based dynamic time-frequency division duplexing (D-TFDD) scheme to meet asymmetric and heterogeneous traffic demands. We formulate the problem as a decentralized partially observable Markov decision process (Dec-POMDP) In order to jointly optimize the global resources in a decentralized manner, we propose a federated reinforcement learning (RL) algorithm named Wolpertinger deep deterministic policy gradient (FWDDPG) algorithm.
arXiv Detail & Related papers (2022-11-04T07:39:21Z)
Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains. We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z)
Distributed Multi-agent Meta Learning for Trajectory Design in Wireless Drone Networks [151.27147513363502]
This paper studies the problem of the trajectory design for a group of energyconstrained drones operating in dynamic wireless network environments. A value based reinforcement learning (VDRL) solution and a metatraining mechanism is proposed.
arXiv Detail & Related papers (2020-12-06T01:30:12Z)
Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.