Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
- URL: http://arxiv.org/abs/2510.12720v1
- Date: Tue, 14 Oct 2025 17:00:09 GMT
- Title: Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
- Authors: Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, Eng Siong Chng, Xie Chen,
- Abstract summary: We present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark.<n>We propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data.<n>Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception.
- Score: 97.32606786622728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.
Related papers
- Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation? [41.8825737520503]
We study whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation.<n>omni-LLMs naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback.<n>Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.
arXiv Detail & Related papers (2026-02-02T04:36:23Z) - JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation [16.067014259345743]
We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset.<n>Even the best-performing Omni-LLM achieves an average accuracy of only 62.6%, outperforming uni-modal baselines.
arXiv Detail & Related papers (2025-12-14T17:23:21Z) - OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM [146.029449832893]
We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM.<n>For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings.
arXiv Detail & Related papers (2025-10-17T17:59:59Z) - OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination [32.43796002503023]
We propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in Omni-modal large language models (OLLMs)<n>By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination.
arXiv Detail & Related papers (2025-08-31T07:19:32Z) - OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks [77.19223035769248]
Recent breakthroughs in large multimodal models (LMMs) have demonstrated remarkable proficiency in following general-purpose instructions for image generation.<n>We introduce OmniGenBench, a novel benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs.<n>Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand.
arXiv Detail & Related papers (2025-05-24T16:29:34Z) - OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs [6.365802395342737]
We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task.<n>We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models.<n>We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning.
arXiv Detail & Related papers (2025-03-27T13:12:49Z) - Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities.<n>Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures.<n>Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL.<n>Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios.
arXiv Detail & Related papers (2025-02-26T17:26:36Z) - Baichuan-Omni-1.5 Technical Report [78.49101296394218]
Baichuan- Omni-1.5 is an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities.<n>We establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data.<n>Second, an audio-tokenizer has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM.
arXiv Detail & Related papers (2025-01-26T02:19:03Z) - AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.<n>Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.<n>We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z) - OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models.
evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges.
Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.