G-Core: A Simple, Scalable and Balanced RLHF Trainer
- URL: http://arxiv.org/abs/2507.22789v2
- Date: Thu, 31 Jul 2025 02:18:13 GMT
- Title: G-Core: A Simple, Scalable and Balanced RLHF Trainer
- Authors: Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, Ting Yao,
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) has become an increasingly popular paradigm for training large language models.<n>We present textbfG-Core, a simple, scalable, and balanced RLHF training framework designed to address these challenges.
- Score: 35.65011046623611
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) has become an increasingly popular paradigm for training large language models (LLMs) and diffusion models. While existing RLHF training systems have enabled significant progress, they often face challenges in scaling to multi-modal and diffusion workflows and adapting to dynamic workloads. In particular, current approaches may encounter limitations in controller scalability, flexible resource placement, and efficient orchestration when handling complex RLHF pipelines, especially in scenarios involving dynamic sampling or generative reward modeling. In this paper, we present \textbf{G-Core}, a simple, scalable, and balanced RLHF training framework designed to address these challenges. G-Core introduces a parallel controller programming model, enabling flexible and efficient orchestration of complex RLHF workflows without the bottlenecks of a single centralized controller. Furthermore, we propose a dynamic placement schema that adaptively partitions resources and schedules workloads, significantly reducing hardware idle time and improving utilization, even under highly variable training conditions. G-Core has successfully trained models that support WeChat product features serving a large-scale user base, demonstrating its effectiveness and robustness in real-world scenarios. Our results show that G-Core advances the state of the art in RLHF training, providing a solid foundation for future research and deployment of large-scale, human-aligned models.
Related papers
- RLHFless: Serverless Computing for Efficient RLHF [13.743738615300662]
Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training.<n>We present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments.
arXiv Detail & Related papers (2026-02-26T07:45:37Z) - SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning [24.80806018678682]
Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models.<n>In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability.<n>We propose a framework that sustains effective learning signals through adaptive environment design.
arXiv Detail & Related papers (2026-01-08T10:42:04Z) - Sample-Efficient Neurosymbolic Deep Reinforcement Learning [49.60927398960061]
We propose a neuro-symbolic Deep RL approach that integrates background symbolic knowledge to improve sample efficiency.<n>Online reasoning is performed to guide the training process through two mechanisms.<n>We show improved performance over a state-of-the-art reward machine baseline.
arXiv Detail & Related papers (2026-01-06T09:28:53Z) - Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning [61.380634253724594]
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL)<n>We show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model.
arXiv Detail & Related papers (2025-12-23T18:51:50Z) - Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models [71.9060068259379]
We propose cascaded domain-wise reinforcement learning to build general-purpose reasoning models.<n>Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6 Pro and silver-medal performance in the 2025 International Olympiad in Informatics (IOI)
arXiv Detail & Related papers (2025-12-15T18:02:35Z) - Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter [52.111923076688505]
Training Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving.<n>We propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding.
arXiv Detail & Related papers (2025-11-20T18:59:25Z) - Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance [46.06527859746679]
We introduce Reinforcement Learning Guidance (RLG), an inference-time method that adapts Dejin-Free Guidance (CFG)<n>RLG consistently improves the performance of RL fine-tuned models across various, RL algorithms, and downstream tasks, including human preferences, compositional control, compress, and text rendering.<n>Our approach provides a practical and theoretically sound solution for enhancing and controlling diffusion model alignment inference.
arXiv Detail & Related papers (2025-08-28T17:18:31Z) - WeChat-YATT: A Scalable, Simple, Efficient, and Production Ready Training Library [34.5103280294468]
We introduce WeChat-YATT Yet Another Transformer Trainer (YATT) in WeChat, a simple, scalable, and balanced RLHF training framework.<n>YATT features a parallel controller programming model that enables flexible and efficient orchestration of complex RLHF.<n>We evaluate WeChat-YATT across diverse experimental scenarios, demonstrating its substantial throughput improvements over state-of-the-art RLHF training frameworks.
arXiv Detail & Related papers (2025-08-11T13:31:53Z) - Scaling Offline RL via Efficient and Expressive Shortcut Models [13.050231036248338]
offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes.<n>We introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models to scale both training and inference.<n>We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute.
arXiv Detail & Related papers (2025-05-28T20:59:22Z) - Multi-fidelity Reinforcement Learning Control for Complex Dynamical Systems [42.2790464348673]
We propose a multi-fidelity reinforcement learning framework for controlling instabilities in complex systems.<n>The effect of the proposed framework is demonstrated on two complex dynamics in physics.
arXiv Detail & Related papers (2025-04-08T00:50:15Z) - Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [83.53178716807776]
This study explores the scaling properties of Reinforcement Learning from Human Feedback in Large Language Models.<n>We analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance.
arXiv Detail & Related papers (2024-12-08T17:19:48Z) - Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.<n>The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.<n>We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework [27.336483161388777]
We introduce OpenRLHF, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers.<n> Experimental results show that OpenRLHF achieves superior training efficiency with speedups ranging from 1.22x to 1.68x across different model sizes.
arXiv Detail & Related papers (2024-05-20T01:04:40Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training [11.749347656959822]
We propose a flexible model placement framework that offers two general and agile model placement strategies.
Our framework provides a simple user interface and guidelines to easily and flexibly configure these strategies in various training scenarios.
arXiv Detail & Related papers (2023-12-19T03:24:55Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - Learning a model is paramount for sample efficiency in reinforcement
learning control of PDEs [5.488334211013093]
We show that learning an actuated model in parallel to training the RL agent significantly reduces the total amount of required data sampled from the real system.
We also show that iteratively updating the model is of major importance to avoid biases in the RL training.
arXiv Detail & Related papers (2023-02-14T16:14:39Z) - Deep Reinforcement Learning for Computational Fluid Dynamics on HPC
Systems [17.10464381844892]
Reinforcement learning (RL) is highly suitable for devising control strategies in the context of dynamical systems.
Recent research results indicate that RL-augmented computational fluid dynamics (CFD) solvers can exceed the current state of the art.
We present Relexi as a scalable RL framework that bridges the gap between machine learning and modern CFD solvers on HPC systems.
arXiv Detail & Related papers (2022-05-13T08:21:18Z) - Reinforcement Learning as One Big Sequence Modeling Problem [84.84564880157149]
Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models.
We view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards.
arXiv Detail & Related papers (2021-06-03T17:58:51Z) - Regularizing Generative Adversarial Networks under Limited Data [88.57330330305535]
This work proposes a regularization approach for training robust GAN models on limited data.
We show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data.
arXiv Detail & Related papers (2021-04-07T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.