RLHFless: Serverless Computing for Efficient RLHF
- URL: http://arxiv.org/abs/2602.22718v1
- Date: Thu, 26 Feb 2026 07:45:37 GMT
- Title: RLHFless: Serverless Computing for Efficient RLHF
- Authors: Rui Wei, Hanfei Yu, Shubham Jain, Yogarajan Sivakumar, Devesh Tiwari, Jian Li, Seung-Jong Park, Hao Wang,
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training.<n>We present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments.
- Score: 13.743738615300662
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences. Recent models, such as DeepSeek-R1, have also shown RLHF's potential to improve LLM reasoning on complex tasks. In RL, inference and training co-exist, creating dynamic resource demands throughout the workflow. Compared to traditional RL, RLHF further challenges training efficiency due to expanding model sizes and resource consumption. Several RLHF frameworks aim to balance flexible abstraction and efficient execution. However, they rely on serverful infrastructures, which struggle with fine-grained resource variability. As a result, during synchronous RLHF training, idle time between or within RL components often causes overhead and resource wastage. To address these issues, we present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments. RLHFless adapts to dynamic resource demands throughout the RLHF pipeline, pre-computes shared prefixes to avoid repeated computation, and uses a cost-aware actor scaling strategy that accounts for response length variation to find sweet spots with lower cost and higher speed. In addition, RLHFless assigns workloads efficiently to reduce intra-function imbalance and idle time. Experiments on both physical testbeds and a large-scale simulated cluster show that RLHFless achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline.
Related papers
- Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter [52.111923076688505]
Training Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving.<n>We propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding.
arXiv Detail & Related papers (2025-11-20T18:59:25Z) - RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs [48.94639777633359]
We present RLBoost, a systematic solution for cost-efficient RL training that harvests preemptible GPU resources.<n> RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources.
arXiv Detail & Related papers (2025-10-22T04:19:37Z) - QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs [80.76334908639745]
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs)<n>QeRL addresses issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA)<n>Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase.
arXiv Detail & Related papers (2025-10-13T17:55:09Z) - G-Core: A Simple, Scalable and Balanced RLHF Trainer [35.65011046623611]
Reinforcement Learning from Human Feedback (RLHF) has become an increasingly popular paradigm for training large language models.<n>We present textbfG-Core, a simple, scalable, and balanced RLHF training framework designed to address these challenges.
arXiv Detail & Related papers (2025-07-30T15:55:08Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [83.53178716807776]
This study explores the scaling properties of Reinforcement Learning from Human Feedback in Large Language Models.<n>We analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance.
arXiv Detail & Related papers (2024-12-08T17:19:48Z) - Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models [11.624678008637623]
We propose separating generation and learning in RLHF.<n>Online DPO is found to be most robust to off-policy data.<n>Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
arXiv Detail & Related papers (2024-10-23T19:59:50Z) - OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework [29.16323641419201]
Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF)<n>OpenRLHF is a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers.
arXiv Detail & Related papers (2024-05-20T01:04:40Z) - RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z) - Parameter Efficient Reinforcement Learning from Human Feedback [27.687265760622918]
Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language and Vision-Language Models with human preferences.
To alleviate some of the computational burden of fine-tuning, efficient methods, like LoRA were introduced.
We benchmark the PE-RLHF setup on six diverse datasets spanning summarization, harmless/helpful response generation, UI automation, and visual question answering.
arXiv Detail & Related papers (2024-03-15T21:43:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.