RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods
- URL: http://arxiv.org/abs/2511.03939v1
- Date: Thu, 06 Nov 2025 00:35:17 GMT
- Title: RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods
- Authors: Raghav Sharma, Manan Mehta, Sai Tiger Raina,
- Abstract summary: This survey synthesizes the new frontier of alignment research by addressing critical gaps in multi-modal alignment, cultural fairness, and low-latency optimization.<n>By providing a comparative synthesis of these techniques and outlining open challenges, this work serves as an essential roadmap for researchers building more robust, efficient, and equitable AI systems.
- Score: 0.09558392439655011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) is the standard for aligning Large Language Models (LLMs), yet recent progress has moved beyond canonical text-based methods. This survey synthesizes the new frontier of alignment research by addressing critical gaps in multi-modal alignment, cultural fairness, and low-latency optimization. To systematically explore these domains, we first review foundational algo- rithms, including PPO, DPO, and GRPO, before presenting a detailed analysis of the latest innovations. By providing a comparative synthesis of these techniques and outlining open challenges, this work serves as an essential roadmap for researchers building more robust, efficient, and equitable AI systems.
Related papers
- Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities [62.05713042908654]
This paper provides a review of advances in Large Language Models (LLMs) alignment through the lens of inverse reinforcement learning (IRL)<n>We highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift.
arXiv Detail & Related papers (2025-07-17T14:22:24Z) - A Technical Survey of Reinforcement Learning Techniques for Large Language Models [33.38582292895673]
Reinforcement Learning (RL) has emerged as a transformative approach for aligning and enhancing Large Language Models (LLMs)<n>RLHF remains dominant for alignment, and outcome-based RL such as RLVR significantly improves stepwise reasoning.<n> persistent challenges such as reward hacking, computational costs, and scalable feedback collection underscore the need for continued innovation.
arXiv Detail & Related papers (2025-07-05T19:13:00Z) - A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z) - A Survey on Post-training of Large Language Models [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Integration and Adaptation, which
arXiv Detail & Related papers (2025-03-08T05:41:42Z) - Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information [5.655057078073446]
Post-alignment of large language models (LLMs) is critical in improving their utility, safety, and alignment with human intentions.<n>Direct preference optimisation (DPO) has become one of the most widely used algorithms for achieving this alignment.<n>This paper introduces a unifying framework inspired by mutual information, which proposes a new loss function with flexible priors.
arXiv Detail & Related papers (2025-01-02T21:31:38Z) - A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications [49.58110250828268]
Direct Preference Optimization (DPO) has emerged as a promising approach for alignment.<n>Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature.
arXiv Detail & Related papers (2024-10-21T02:27:24Z) - Towards Automated Machine Learning Research [4.169915659794567]
This paper explores a top-down approach to automating incremental advances in machine learning research through component-level innovation.
Our framework systematically generates novel components, validates their feasibility, and evaluates their performance against existing baselines.
By incorporating a reward model to prioritize promising hypotheses, we aim to improve the efficiency of the hypothesis generation and evaluation process.
arXiv Detail & Related papers (2024-09-09T00:47:30Z) - Towards a Unified View of Preference Learning for Large Language Models: A Survey [88.66719962576005]
Large Language Models (LLMs) exhibit remarkably powerful capabilities.
One of the crucial factors to achieve success is aligning the LLM's output with human preferences.
We decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm.
arXiv Detail & Related papers (2024-09-04T15:11:55Z) - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF)
We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment.
We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.