Related papers: OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

URL: http://arxiv.org/abs/2512.23646v1
Date: Mon, 29 Dec 2025 17:59:05 GMT
Title: OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding
Authors: Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang,
Abstract summary: We introduce OmniAgent, a fully audio-guided active perception agent.<n>This paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry.
Score: 23.176694412214157
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.

Related papers

OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z)
OmniGAIA: Towards Native Omni-Modal AI Agents [103.79729735478924]
We introduce a benchmark designed to evaluate omni-modal agents on tasks requiring deep reasoning and multi-turn tool execution.<n>We propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception.
arXiv Detail & Related papers (2026-02-26T11:35:04Z)
ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding [32.72568710955575]
We present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction.<n> ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches.<n>For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering.
arXiv Detail & Related papers (2026-01-15T12:09:04Z)
Apollo: Unified Multi-Task Audio-Video Joint Generation [15.004783109205666]
Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation.<n>We introduce Apollo and delve into three axes--model architecture, training strategy, and data curation.<n>For datasets, we present the first large-scale audio-video dataset with dense captions.
arXiv Detail & Related papers (2026-01-07T18:03:45Z)
AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding [73.05946667683259]
Recent large language models (MLLMs) show strong perception but struggle in multi-speaker, dialogue-centric settings.<n>We introduce AMUSE, a benchmark designed around tasks that are inherently agentic.<n>We propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation.
arXiv Detail & Related papers (2025-12-18T07:01:47Z)
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound [5.591620304505415]
This work presents the first formal framework for Audio-Visual World Models (AVWM)<n>It formulates multimodal environment simulation as a partially observable decision process with audio-visual observations, fine-grained actions, and task rewards.<n>We propose an Audio-Visual Conditional Transformer with a novel modality expert architecture that balances visual and auditory learning.
arXiv Detail & Related papers (2025-11-30T13:11:56Z)
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models [35.86252379746625]
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs)<n>In current AV-LLMs, audio and video features are typically processed jointly in the decoder.<n>We propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications.
arXiv Detail & Related papers (2025-05-27T08:22:56Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Baichuan-Omni-1.5 Technical Report [78.49101296394218]
Baichuan- Omni-1.5 is an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities.<n>We establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data.<n>Second, an audio-tokenizer has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM.
arXiv Detail & Related papers (2025-01-26T02:19:03Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
Efficient Multimodal Neural Networks for Trigger-less Voice Assistants [0.8209843760716959]
We propose a neural network based audio-gesture multimodal fusion system for smartwatches. The system better understands temporal correlation between audio and gesture data, leading to precise invocations. It is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times.
arXiv Detail & Related papers (2023-05-20T02:52:02Z)
MAAS: Multi-modal Assignation for Active Speaker Detection [59.08836580733918]
We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem. Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
arXiv Detail & Related papers (2021-01-11T02:57:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.