Related papers: Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

URL: http://arxiv.org/abs/2602.00846v1
Date: Sat, 31 Jan 2026 18:20:45 GMT
Title: Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis
Authors: Zicheng Kong, Dehua Ma, Zhenbo Xu, Alven Yang, Yiwei Ru, Haoran Wang, Zixuan Zhou, Fuqing Bie, Liuyu Xiang, Huijia Wu, Jian Zhao, Zhaofeng He,
Abstract summary: A critical bottleneck remains the lack of effective reward models (RMs)<n>We introduce textbf Omni-RRM, the first open-source rubric-grounded reward model.<n>It produces structured, multi-dimension preference judgments with dimension-wise justifications across textbftext, image, video, and audio
Score: 22.55861092515539
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.

Related papers

SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z)
SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder [54.31950189922548]
Reward models (RMs) are proxies for human preference evaluation and guiding model alignment.<n>We propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations.<n>SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters.
arXiv Detail & Related papers (2025-11-11T06:51:56Z)
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences [38.99630864553283]
We propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences.<n>We construct a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs.<n>We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.
arXiv Detail & Related papers (2025-10-27T15:53:20Z)
UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation [104.59740403500132]
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance.<n>We propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC)<n>Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels.
arXiv Detail & Related papers (2025-09-19T17:29:25Z)
Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs [28.41899655478021]
We propose Omni-DPO, a dual-perspective optimization framework that accounts for the inherent quality of each preference pair and the model's evolving performance on those pairs.<n> Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO.
arXiv Detail & Related papers (2025-06-11T17:58:05Z)
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning [28.478879569025583]
We introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets.<n>In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences.<n>In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity.
arXiv Detail & Related papers (2025-05-30T17:44:28Z)
Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment [35.80989342492335]
noisy preferences in human feedback can lead to reward misgeneralization.<n>This paper aims to identify how noisy preferences differ from human-aligned preferences in reward modeling.<n>We propose an online Collaborative Reward Modeling framework to achieve robust preference learning.
arXiv Detail & Related papers (2025-05-15T10:58:20Z)
SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner [35.843587407696006]
Large Visual Language Models (LVLMs) increasingly rely on preference alignment to ensure reliability.<n>We present projectname, a self-supervised framework capable of transforming the already abundant supervised text-image pairs into holistic preference triplets.
arXiv Detail & Related papers (2025-03-06T08:33:11Z)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.<n>We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.<n>Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z)
EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation [58.546205554954454]
We propose Enhancing Alignment in MLLMs via Critical Observation (EACO)<n>EACO aligns MLLMs by self-generated preference data using only 5k images economically.<n>EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition.
arXiv Detail & Related papers (2024-12-06T09:59:47Z)
OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models. evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.