Related papers: GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

URL: http://arxiv.org/abs/2508.17850v7
Date: Thu, 16 Oct 2025 07:19:57 GMT
Title: GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning
Authors: Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, Yangning Li, Xiang Li, Jiangyue Yan, Yaoqi Liu, Liwen Jing, Jiayin Qi, Ruifeng Xu, Binxing Fang, Yue Yu,
Abstract summary: We propose HeteroRL, a heterogeneous RL architecture that decouples parameter learning and rollout sampling.<n>The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency.<n> Experiments show GEPO achieves superior stability - only a 3% performance drop from online to 1800s latency.
Score: 43.46954951944727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show GEPO achieves superior stability - only a 3% performance drop from online to 1800s latency-and reduces the best-to-last gap by 85% versus GSPO (1.8 vs. 12.0) while attaining the highest scores, highlighting its effectiveness in decentralized, resource-heterogeneous environments.

Related papers

Harnessing Implicit Cooperation: A Multi-Agent Reinforcement Learning Approach Towards Decentralized Local Energy Markets [41.99844472131922]
Decentralized agents can approximate optimal coordination in local energy markets without explicit peer-to-peer communication.<n>Stigmergic signaling provides sufficient context for complex grid coordination, offering a robust, privacy-preserving alternative to expensive centralized communication infrastructure.
arXiv Detail & Related papers (2026-02-17T22:22:32Z)
Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration [56.074760766965085]
PRISM achieves a dynamics-aware framework that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge.<n>Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
arXiv Detail & Related papers (2026-01-12T05:43:20Z)
Multi-Agent Reinforcement Learning for Sample-Efficient Deep Neural Network Mapping [54.65536245955678]
We present a decentralized multi-agent reinforcement learning (MARL) framework designed to overcome the challenge of sample inefficiency.<n>We introduce an agent clustering algorithm that assigns similar mapping parameters to the same agents based on correlation analysis.<n> Experimental results show our MARL approach improves sample efficiency by 30-300x over standard single-agent RL.
arXiv Detail & Related papers (2025-07-22T05:51:07Z)
Synergizing Reinforcement Learning and Genetic Algorithms for Neural Combinatorial Optimization [25.633698252033756]
We propose the Evolutionary Augmentation Mechanism (EAM) to synergize the learning efficiency of DRL with the global search power of GAs.<n>EAM operates by generating solutions from a learned policy and refining them through domain-specific genetic operations such as crossover and mutation.<n>EAM can be seamlessly integrated with state-of-the-art DRL solvers such as the Attention Model, POMO, and SymNCO.
arXiv Detail & Related papers (2025-06-11T05:17:30Z)
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [55.06360285372418]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias.<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.
arXiv Detail & Related papers (2025-05-18T11:08:32Z)
Cluster-Aware Multi-Round Update for Wireless Federated Learning in Heterogeneous Environments [25.405210975577834]
This paper proposes a clustering strategy that leverages prior knowledge similarity to group devices with similar data and communication characteristics.<n>A novel Cluster- Aware Multi-round Update (CAMU) strategy is proposed, which treats clusters as the basic units and adjusts the local update frequency based on the clustered contribution threshold.
arXiv Detail & Related papers (2025-05-06T02:48:48Z)
OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters [1.4131700241686853]
We develop an adaptive batch-scaling framework called OmniLearn to mitigate the effects of heterogeneous resources.<n>Our approach is inspired by proportional controllers to balance across heterogeneous servers, and works under varying resource availability.
arXiv Detail & Related papers (2025-03-21T18:26:24Z)
You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data [54.56492110703343]
Data heterogeneity, stemming from local non-IID data and global long-tailed distributions, is a major challenge in federated learning (FL)<n>We propose FedYoYo to improve representation learning by distilling knowledge between weakly and strongly augmented local samples.<n>We show FedYoYo achieves state-of-the-art results, even surpassing centralized logit adjustment methods by 5.4% under global long-tailed settings.
arXiv Detail & Related papers (2025-03-10T04:57:20Z)
SCALE: Self-regulated Clustered federAted LEarning in a Homogeneous Environment [4.925906256430176]
Federated Learning (FL) has emerged as a transformative approach for enabling distributed machine learning while preserving user privacy. This paper presents a novel FL methodology that overcomes these limitations by eliminating the dependency on edge servers.
arXiv Detail & Related papers (2024-07-25T20:42:16Z)
Distribution-Dependent Rates for Multi-Distribution Learning [26.38831409926518]
Recent multi-distribution learning framework tackles this objective in a dynamic interaction with the environment. We provide distribution-dependent guarantees in the MDL regime, that scale with suboptimality gaps and result in superior dependence on the sample size. We devise an adaptive optimistic algorithm, LCB-DR, that showcases enhanced dependence on the gaps, mirroring the contrast between uniform and optimistic allocation in the multi-armed bandit literature.
arXiv Detail & Related papers (2023-12-20T15:50:16Z)
Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints [26.274786600234876]
The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but amplify safety concerns. RLHF has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. DPO has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint. We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified
arXiv Detail & Related papers (2023-09-28T08:29:44Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
Heterogeneous Federated Learning via Grouped Sequential-to-Parallel Training [60.892342868936865]
Federated learning (FL) is a rapidly growing privacy-preserving collaborative machine learning paradigm. We propose a data heterogeneous-robust FL approach, FedGSP, to address this challenge. We show that FedGSP improves the accuracy by 3.7% on average compared with seven state-of-the-art approaches.
arXiv Detail & Related papers (2022-01-31T03:15:28Z)
False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
Decentralized Local Stochastic Extra-Gradient for Variational Inequalities [125.62877849447729]
We consider distributed variational inequalities (VIs) on domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that covers the settings of fully decentralized calculations. We theoretically analyze its convergence rate in the strongly-monotone, monotone, and non-monotone settings.
arXiv Detail & Related papers (2021-06-15T17:45:51Z)
Combining Pessimism with Optimism for Robust and Efficient Model-Based Deep Reinforcement Learning [56.17667147101263]
In real-world tasks, reinforcement learning agents encounter situations that are not present during training time. To ensure reliable performance, the RL agents need to exhibit robustness against worst-case situations. We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to provably solve this problem.
arXiv Detail & Related papers (2021-03-18T16:50:17Z)
A Decentralized Approach to Bayesian Learning [26.74338464389837]
Motivated by decentralized approaches to machine learning, we propose a collaborative learning taking the form of decentralized Langevin dynamics. Our analysis show that the initial KL-divergence between the Markov Chain and the target posterior distribution is exponentially decreasing. The performance of individual agents with locally available data is on par with the centralized setting with considerable improvement in the rate.
arXiv Detail & Related papers (2020-07-14T03:59:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.