Related papers: Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning

Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning

URL: http://arxiv.org/abs/2509.09135v2
Date: Wed, 17 Sep 2025 23:07:30 GMT
Title: Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning
Authors: Xuefeng Wang, Lei Zhang, Henglin Pu, Ahmed H. Qureshi, Husheng Li,
Abstract summary: We use physics-informed neural networks to approximate HJB-based value functions at scale.<n>This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning.<n>Our results demonstrate that our approach consistently outperforms existing continuous-time baselines and scales to complex multi-agent dynamics.
Score: 27.73410730631346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by replacing discrete-time Bellman recursion with differential value functions defined as viscosity solutions of the Hamilton--Jacobi--Bellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single-agent domain. This limitation stems from two key challenges: (i) conventional solution methods for HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high-dimensional systems; and (ii) even with HJB-based learning approaches, accurately approximating centralized value functions in multi-agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning. We evaluate our method using continuous-time variants of standard benchmarks, including multi-agent particle environment (MPE) and multi-agent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuous-time RL baselines and scales to complex multi-agent dynamics.

Related papers

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z)
Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models [0.0]
We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimize the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs)<n>DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs)<n>This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to rank reduction techniques in resource-constrained deep learning.
arXiv Detail & Related papers (2025-12-17T21:09:19Z)
Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning [24.476713156225685]
Value decomposition is a central approach in multi-agent reinforcement learning (MARL)<n>Existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity.<n>We show that unconstrained, non-monotonic factorization reliably recovers IGM-optimal solutions and consistently outperforms monotonic baselines.
arXiv Detail & Related papers (2025-11-12T22:49:35Z)
Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning [20.424372965054832]
We propose a Physics-informed (Pi) regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE)<n>Our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures.<n>When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method yields significant improvements in both performance and generalization.
arXiv Detail & Related papers (2025-09-08T15:08:42Z)
Stackelberg Coupling of Online Representation Learning and Reinforcement Learning [45.70357546589222]
We introduce SCORER, a framework for value-based RL that views representation and Q-learning as two strategic agents in a hierarchical game.<n>Our proposed SCORER framework leads to a bi-level optimization problem whose solution is approximated by a two-timescale algorithm.
arXiv Detail & Related papers (2025-08-10T18:36:54Z)
ROCM: RLHF on consistency models [8.905375742101707]
We propose a reward optimization framework for applying RLHF to consistency models.<n>We investigate various $f$-divergences as regularization strategies, striking a balance between reward and model consistency.
arXiv Detail & Related papers (2025-03-08T11:19:48Z)
Adaptive Multi-Scale Decomposition Framework for Time Series Forecasting [26.141054975797868]
We propose a novel Adaptive Multi-Scale Decomposition (AMD) framework for time series forecasting.<n>Our framework decomposes time series into distinct temporal patterns at multiple scales, leveraging the Multi-Scale Decomposable Mixing (MDM) block.<n>Our approach effectively models both temporal and channel dependencies and utilizes autocorrelation to refine multi-scale data integration.
arXiv Detail & Related papers (2024-06-06T05:27:33Z)
Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL) We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$. The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic [61.968469104271676]
We propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.
arXiv Detail & Related papers (2023-01-28T04:12:56Z)
Residual Q-Networks for Value Function Factorizing in Multi-Agent Reinforcement Learning [0.0]
We propose a novel concept of Residual Q-Networks (RQNs) for Multi-Agent Reinforcement Learning (MARL) The RQN learns to transform the individual Q-value trajectories in a way that preserves the Individual-Global-Max criteria (IGM) The proposed method converges faster, with increased stability and shows robust performance in a wider family of environments.
arXiv Detail & Related papers (2022-05-30T16:56:06Z)
Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC. We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z)
SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms. SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.