OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding
- URL: http://arxiv.org/abs/2512.23646v1
- Date: Mon, 29 Dec 2025 17:59:05 GMT
- Title: OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding
- Authors: Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang,
- Abstract summary: We introduce OmniAgent, a fully audio-guided active perception agent.<n>This paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry.
- Score: 23.176694412214157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
Related papers
- OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z) - OmniGAIA: Towards Native Omni-Modal AI Agents [103.79729735478924]
We introduce a benchmark designed to evaluate omni-modal agents on tasks requiring deep reasoning and multi-turn tool execution.<n>We propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception.
arXiv Detail & Related papers (2026-02-26T11:35:04Z) - ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding [32.72568710955575]
We present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction.<n> ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches.<n>For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering.
arXiv Detail & Related papers (2026-01-15T12:09:04Z) - Apollo: Unified Multi-Task Audio-Video Joint Generation [15.004783109205666]
Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation.<n>We introduce Apollo and delve into three axes--model architecture, training strategy, and data curation.<n>For datasets, we present the first large-scale audio-video dataset with dense captions.
arXiv Detail & Related papers (2026-01-07T18:03:45Z) - AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding [73.05946667683259]
Recent large language models (MLLMs) show strong perception but struggle in multi-speaker, dialogue-centric settings.<n>We introduce AMUSE, a benchmark designed around tasks that are inherently agentic.<n>We propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation.
arXiv Detail & Related papers (2025-12-18T07:01:47Z) - Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound [5.591620304505415]
This work presents the first formal framework for Audio-Visual World Models (AVWM)<n>It formulates multimodal environment simulation as a partially observable decision process with audio-visual observations, fine-grained actions, and task rewards.<n>We propose an Audio-Visual Conditional Transformer with a novel modality expert architecture that balances visual and auditory learning.
arXiv Detail & Related papers (2025-11-30T13:11:56Z) - Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models [35.86252379746625]
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs)<n>In current AV-LLMs, audio and video features are typically processed jointly in the decoder.<n>We propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications.
arXiv Detail & Related papers (2025-05-27T08:22:56Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - Baichuan-Omni-1.5 Technical Report [78.49101296394218]
Baichuan- Omni-1.5 is an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities.<n>We establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data.<n>Second, an audio-tokenizer has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM.
arXiv Detail & Related papers (2025-01-26T02:19:03Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - Efficient Multimodal Neural Networks for Trigger-less Voice Assistants [0.8209843760716959]
We propose a neural network based audio-gesture multimodal fusion system for smartwatches.
The system better understands temporal correlation between audio and gesture data, leading to precise invocations.
It is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times.
arXiv Detail & Related papers (2023-05-20T02:52:02Z) - MAAS: Multi-modal Assignation for Active Speaker Detection [59.08836580733918]
We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem.
Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
arXiv Detail & Related papers (2021-01-11T02:57:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.