Related papers: OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

URL: http://arxiv.org/abs/2602.05847v1
Date: Thu, 05 Feb 2026 16:35:19 GMT
Title: OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
Authors: Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang,
Abstract summary: We propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning.<n>Experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines.
Score: 31.594799790151345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

Related papers

OmniGAIA: Towards Native Omni-Modal AI Agents [103.79729735478924]
We introduce a benchmark designed to evaluate omni-modal agents on tasks requiring deep reasoning and multi-turn tool execution.<n>We propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception.
arXiv Detail & Related papers (2026-02-26T11:35:04Z)
OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding [23.176694412214157]
We introduce OmniAgent, a fully audio-guided active perception agent.<n>This paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry.
arXiv Detail & Related papers (2025-12-29T17:59:05Z)
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation [61.98887854225878]
We introduce UnityVideo, a unified framework for world-aware video generation.<n>Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner.<n>We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints.
arXiv Detail & Related papers (2025-12-08T18:59:01Z)
ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z)
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs [72.425061028374]
We introduce OmniVideoBench, a benchmark dedicated to assessing synergistic audio-visual understanding.<n> OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces.<n>We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.
arXiv Detail & Related papers (2025-10-12T16:34:00Z)
OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination [32.43796002503023]
We propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in Omni-modal large language models (OLLMs)<n>By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination.
arXiv Detail & Related papers (2025-08-31T07:19:32Z)
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration [50.38965090742822]
Long-grained video-audio reasoning and fine-grained pixel impose conflicting requirements on omnimodal models.<n>We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informatives and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding.<n>Because optimalhorizon'' selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy.
arXiv Detail & Related papers (2025-05-26T17:34:06Z)
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [108.73513190593232]
Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet struggle with structured cross-modal reasoning.<n>We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs.
arXiv Detail & Related papers (2025-05-07T17:59:49Z)
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts [46.77966058862399]
We introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts.<n>We propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.
arXiv Detail & Related papers (2025-03-29T02:46:58Z)
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [33.70837005629285]
We propose video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks.<n>We develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions.<n>We also introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs.
arXiv Detail & Related papers (2025-02-17T13:07:40Z)
OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models. evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.