Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
- URL: http://arxiv.org/abs/2510.23451v1
- Date: Mon, 27 Oct 2025 15:53:20 GMT
- Title: Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
- Authors: Zhuoran Jin, Hongbang Yuan, Kejian Zhu, Jiachun Li, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao,
- Abstract summary: We propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences.<n>We construct a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs.<n>We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.
- Score: 38.99630864553283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.
Related papers
- OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z) - OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention [31.594799790151345]
We propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning.<n>Experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines.
arXiv Detail & Related papers (2026-02-05T16:35:19Z) - Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis [22.55861092515539]
A critical bottleneck remains the lack of effective reward models (RMs)<n>We introduce textbf Omni-RRM, the first open-source rubric-grounded reward model.<n>It produces structured, multi-dimension preference judgments with dimension-wise justifications across textbftext, image, video, and audio
arXiv Detail & Related papers (2026-01-31T18:20:45Z) - UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels [12.233067923710635]
Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models.<n>We propose a novel, high quality and UNified Omni model benchmark, UNO-Bench, which effectively assesses both UNi-modal and Omni-modal capabilities.<n>The benchmark consists of curated human samples, with 98% cross-modality solvability, across 44 task types, and an innovative multi-step open-ended question type for assessing complex reasoning.
arXiv Detail & Related papers (2025-10-21T06:14:40Z) - Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs [28.41899655478021]
We propose Omni-DPO, a dual-perspective optimization framework that accounts for the inherent quality of each preference pair and the model's evolving performance on those pairs.<n> Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO.
arXiv Detail & Related papers (2025-06-11T17:58:05Z) - RoboEgo System Card: An Omnimodal Model with Native Full Duplexity [48.52383812141669]
RoboEgo (alias: FLM-Ego) is a unified model system designed to address both challenges.<n>FLM-Ego incorporates backbone and algorithms that support fullity, achieving a theoretical duplex of 80 ms latency.
arXiv Detail & Related papers (2025-06-02T17:53:10Z) - Ola: Pushing the Frontiers of Omni-Modal Language Model [88.72389428177942]
We present Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding.<n>Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements.<n>We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.
arXiv Detail & Related papers (2025-02-06T18:59:55Z) - Baichuan-Omni-1.5 Technical Report [78.49101296394218]
Baichuan- Omni-1.5 is an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities.<n>We establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data.<n>Second, an audio-tokenizer has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM.
arXiv Detail & Related papers (2025-01-26T02:19:03Z) - OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models.
evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges.
Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.