OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
- URL: http://arxiv.org/abs/2602.05847v1
- Date: Thu, 05 Feb 2026 16:35:19 GMT
- Title: OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
- Authors: Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang,
- Abstract summary: We propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning.<n>Experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines.
- Score: 31.594799790151345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
Related papers
- OmniGAIA: Towards Native Omni-Modal AI Agents [103.79729735478924]
We introduce a benchmark designed to evaluate omni-modal agents on tasks requiring deep reasoning and multi-turn tool execution.<n>We propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception.
arXiv Detail & Related papers (2026-02-26T11:35:04Z) - OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding [23.176694412214157]
We introduce OmniAgent, a fully audio-guided active perception agent.<n>This paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry.
arXiv Detail & Related papers (2025-12-29T17:59:05Z) - UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation [61.98887854225878]
We introduce UnityVideo, a unified framework for world-aware video generation.<n>Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner.<n>We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints.
arXiv Detail & Related papers (2025-12-08T18:59:01Z) - ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z) - OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs [72.425061028374]
We introduce OmniVideoBench, a benchmark dedicated to assessing synergistic audio-visual understanding.<n> OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces.<n>We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.
arXiv Detail & Related papers (2025-10-12T16:34:00Z) - OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination [32.43796002503023]
We propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in Omni-modal large language models (OLLMs)<n>By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination.
arXiv Detail & Related papers (2025-08-31T07:19:32Z) - Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration [50.38965090742822]
Long-grained video-audio reasoning and fine-grained pixel impose conflicting requirements on omnimodal models.<n>We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informatives and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding.<n>Because optimalhorizon'' selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy.
arXiv Detail & Related papers (2025-05-26T17:34:06Z) - EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [108.73513190593232]
Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet struggle with structured cross-modal reasoning.<n>We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs.
arXiv Detail & Related papers (2025-05-07T17:59:49Z) - OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts [46.77966058862399]
We introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts.<n>We propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.
arXiv Detail & Related papers (2025-03-29T02:46:58Z) - video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [33.70837005629285]
We propose video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks.<n>We develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions.<n>We also introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs.
arXiv Detail & Related papers (2025-02-17T13:07:40Z) - OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models.
evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges.
Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.