An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space
- URL: http://arxiv.org/abs/2412.12605v1
- Date: Tue, 17 Dec 2024 07:04:39 GMT
- Title: An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space
- Authors: Hai Lin, Cheng Huang, Zhihong Chen,
- Abstract summary: We propose an advantage-based optimization method and an algorithm named Advantage Branching Dueling Q-network (ABQ)<n>ABQ incorporates a baseline mechanism to tune the action value of each dimension, leveraging the advantage relationship across different sub-actions.<n> Empirical results demonstrate that ABQ outperforms BDQ, achieving 3%, 171%, and 84% more cumulative rewards in HalfCheetah, Ant, and Humanoid environments.
- Score: 17.956561845984353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning tasks in real-world scenarios often involve large, high-dimensional action spaces, leading to challenges such as convergence difficulties, instability, and high computational complexity. It is widely acknowledged that traditional value-based reinforcement learning algorithms struggle to address these issues effectively. A prevalent approach involves generating independent sub-actions within each dimension of the action space. However, this method introduces bias, hindering the learning of optimal policies. In this paper, we propose an advantage-based optimization method and an algorithm named Advantage Branching Dueling Q-network (ABQ). ABQ incorporates a baseline mechanism to tune the action value of each dimension, leveraging the advantage relationship across different sub-actions. With this approach, the learned policy can be optimized for each dimension. Empirical results demonstrate that ABQ outperforms BDQ, achieving 3%, 171%, and 84% more cumulative rewards in HalfCheetah, Ant, and Humanoid environments, respectively. Furthermore, ABQ exhibits competitive performance when compared against two continuous action benchmark algorithms, DDPG and TD3.
Related papers
- QoS-Aware Hierarchical Reinforcement Learning for Joint Link Selection and Trajectory Optimization in SAGIN-Supported UAV Mobility Management [52.15690855486153]
A space-air-ground integrated network (SAGIN) has emerged as an essential architecture for enabling ubiquitous UAV connectivity.<n>This paper formulates UAV mobility management in SAGIN as a constrained multiobjective joint optimization problem.
arXiv Detail & Related papers (2025-12-17T06:22:46Z) - Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($λ$,$λ$))-GA [3.5485296570255183]
We conduct a systematic analysis of controlling the population size parameter of the (1+($$,$$)-GA on OneMax instances.<n>Our investigation of DDQN and PPO reveals two fundamental challenges that limit their effectiveness in DAC.<n>We introduce an adaptive reward shifting mechanism that leverages reward distribution statistics to enhance DDQN agent exploration.
arXiv Detail & Related papers (2025-12-03T13:54:41Z) - Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment [24.296667264939515]
We propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the strengths of SFT (supervised fine-tuning) and RL (reinforcement learning)<n>Our theoretical analysis establishes GRAO's convergence guarantees and sample efficiency advantages over conventional approaches.<n>This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.
arXiv Detail & Related papers (2025-08-11T08:28:47Z) - COPO: Consistency-Aware Policy Optimization [17.328515578426227]
Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks.<n>Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization.<n>We propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency.
arXiv Detail & Related papers (2025-08-06T07:05:18Z) - NDCG-Consistent Softmax Approximation with Accelerated Convergence [67.10365329542365]
We propose novel loss formulations that align directly with ranking metrics.<n>We integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method.<n> Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance.
arXiv Detail & Related papers (2025-06-11T06:59:17Z) - Stochastic Primal-Dual Double Block-Coordinate for Two-way Partial AUC Maximization [56.805574957824135]
Two-way partial AUCAUC is a critical performance metric for binary classification with imbalanced data.<n>Existing algorithms for TPAUC optimization remain under-explored.<n>We introduce two innovative double-coordinate block-coordinate algorithms for TPAUC optimization.
arXiv Detail & Related papers (2025-05-28T03:55:05Z) - R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.
Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - AUCSeg: AUC-oriented Pixel-level Long-tail Semantic Segmentation [88.50256898176269]
We develop a pixel-level AUC loss function and conduct a dependency-graph-based theoretical analysis of the algorithm's generalization ability.
We also design a Tail-Classes Memory Bank to manage the significant memory demand.
arXiv Detail & Related papers (2024-09-30T15:31:02Z) - Analyzing design principles for competitive evolution strategies in constrained search spaces [0.0]
In the context of the 2018 IEEE Congress of Evolutionary Computation, the Matrix Adaptation Evolution Strategy for constrained optimization was notably successful.
The $epsilon$MAg-ES algorithm can be considered to be the most successful participant in high dimensions.
This paper presents an empirical analysis of the $epsilon$MAg-ES working principles that is expected to provide insights about the performance contribution of specific algorithmic components.
arXiv Detail & Related papers (2024-05-08T12:20:10Z) - Scalable Online Exploration via Coverability [45.66375686120087]
Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation.
We introduce a new objective, $L_Coverage, which generalizes previous exploration schemes and supports three fundamental desideratas.
$L_Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability.
arXiv Detail & Related papers (2024-03-11T10:14:06Z) - ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages [37.12048108122337]
This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning.
It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm.
arXiv Detail & Related papers (2023-06-02T11:37:22Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Federated Learning via Inexact ADMM [46.99210047518554]
In this paper, we develop an inexact alternating direction method of multipliers (ADMM)
It is both- and communication-efficient, capable of combating the stragglers' effect, and convergent under mild conditions.
It has a high numerical performance compared with several state-of-the-art algorithms for federated learning.
arXiv Detail & Related papers (2022-04-22T09:55:33Z) - An Improved Reinforcement Learning Algorithm for Learning to Branch [12.27934038849211]
Branch-and-bound (B&B) is a general and widely used method for optimization.
In this paper, we propose a novel reinforcement learning-based B&B algorithm.
We evaluate the performance of the proposed algorithm over three public research benchmarks.
arXiv Detail & Related papers (2022-01-17T04:50:11Z) - Causal Policy Gradients [6.123324869194195]
Causal policy gradients (CPGs) provide a common framework for analysing key state-of-the-art algorithms.
CPGs are shown to generalise traditional policy gradients, and yield a principled way of incorporating prior knowledge of a problem domain's generative processes.
arXiv Detail & Related papers (2021-02-20T14:51:12Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.