Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
- URL: http://arxiv.org/abs/2510.05283v1
- Date: Mon, 06 Oct 2025 18:53:23 GMT
- Title: Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
- Authors: Radha Gulhane, Sathish Reddy Indurthi,
- Abstract summary: We propose a hybrid reward modeling framework that integrates complementary reward paradigms.<n>We show consistent improvements across different multimodal benchmarks when applying hybrid and multi-aspect reward modeling.<n>Our best performing model in the 3B family achieves an overall average improvement of 9.5% across general and math reasoning tasks.
- Score: 1.8552770604791606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning multimodal large language models (MLLMs) with human preferences often relies on single-signal, model-based reward methods. Such monolithic rewards often lack confidence calibration across domain-specific tasks, fail to capture diverse aspects of human preferences, and require extensive data annotation and reward model training. In this work, we propose a hybrid reward modeling framework that integrates complementary reward paradigms: (i) model-based rewards, where a learned reward model predicts scalar or vector scores from synthetic and human feedback, and (ii) rule-based rewards, where domain-specific heuristics provide explicit correctness signals with confidence. Beyond accuracy, we further incorporate multi-aspect rewards to enforce instruction adherence and introduce a generalized length-penalty reward to stabilize training and improve performance. The proposed framework provides a flexible and effective approach to aligning MLLMs through reinforcement learning policy optimization. Our experiments show consistent improvements across different multimodal benchmarks when applying hybrid and multi-aspect reward modeling. Our best performing model in the 3B family achieves an overall average improvement of ~9.5% across general and math reasoning tasks. Focusing specifically on mathematical benchmarks, the model achieves a significant average improvement of ~16%, highlighting its effectiveness in mathematical reasoning and problem solving.
Related papers
- SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning [50.93295951454092]
We introduce a set level diversity objective defined over sampled trajectories using kernelized similarity.<n>Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization.<n>Experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
arXiv Detail & Related papers (2026-02-01T07:13:20Z) - Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback [8.538830579425147]
We study estimation and statistical reward models used in aligning large language (LLMs)<n>A key component of LLM alignment is reinforcement learning from human feedback.
arXiv Detail & Related papers (2025-12-02T20:22:25Z) - Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models [63.00458229517523]
This work addresses the evaluation challenge of reward models by probing preference representations.<n>We construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions.<n>We introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability.
arXiv Detail & Related papers (2025-11-16T05:29:29Z) - Activation Reward Models for Few-Shot Model Alignment [77.37511364793515]
We introduce Activation Reward Models (Activation RMs)<n>Activation RMs leverage activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning.<n>We demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications.
arXiv Detail & Related papers (2025-07-02T05:10:29Z) - MOSLIM:Align with diverse preferences in prompts through reward classification [6.6431471703308915]
We introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives.<n> MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase.<n>We demonstrate the efficacy of our proposed method across several multi-objective benchmarks and conduct ablation studies on various reward model sizes and policy optimization methods.
arXiv Detail & Related papers (2025-05-24T12:22:21Z) - Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs)<n>We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards.<n>We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z) - MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment [14.541973333460149]
Mixing Preference Optimization (MPO) is a post-processing framework for aggregating single-objective policies.<n>MPO achieves balanced performance across diverse preferences, outperforming existing models with significantly reduced computational costs.
arXiv Detail & Related papers (2025-02-25T23:22:12Z) - MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.<n>We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.<n>Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - HAF-RM: A Hybrid Alignment Framework for Reward Model Training [51.59246299566669]
We propose a hybrid alignment framework HaF-RM for reward model training.<n>It offers a principled and effective approach to enhancing the performance and alignment of reward models.
arXiv Detail & Related papers (2024-07-04T23:26:56Z) - ALaRM: Align Language Models via Hierarchical Rewards Modeling [41.79125107279527]
We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback.
The framework addresses the limitations of current alignment approaches, by integrating holistic rewards with aspect-specific rewards.
We validate our approach through applications in long-form question answering and machine translation tasks.
arXiv Detail & Related papers (2024-03-11T14:28:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.