See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
- URL: http://arxiv.org/abs/2512.02231v1
- Date: Mon, 01 Dec 2025 21:57:26 GMT
- Title: See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
- Authors: Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee,
- Abstract summary: AV-SpeakerBench is a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos.<n>It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity.
- Score: 24.851643680674474
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.
Related papers
- AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech [56.08149157180447]
We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models.<n>We evaluate 13 models across two providers (OpenAI, Google Gemini) using both reference-based metrics (METEOR, BLEU, ROUGE-L) and an LLM-as-Judge framework.
arXiv Detail & Related papers (2026-02-27T03:33:37Z) - JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation [16.067014259345743]
We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset.<n>Even the best-performing Omni-LLM achieves an average accuracy of only 62.6%, outperforming uni-modal baselines.
arXiv Detail & Related papers (2025-12-14T17:23:21Z) - ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction [88.41471266579333]
We propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models.<n> Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones show significant improvements.
arXiv Detail & Related papers (2025-11-09T08:50:11Z) - M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models [15.324265847938813]
We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding.<n>M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata.<n>Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding.
arXiv Detail & Related papers (2025-10-22T08:28:43Z) - VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents [53.33704332801441]
Large audio language models (LALMs) have greatly enhanced multimodal conversational systems.<n>Existing benchmarks are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation.<n>We present Voice Chat Bot Bench (VCB Bench) -- a high-quality Chinese benchmark built entirely on real human speech.
arXiv Detail & Related papers (2025-10-13T07:45:52Z) - VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing [45.15289852736435]
VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories.<n>To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio.<n>Results reveal three key findings: proprietary models do not universally outperform open-source models.
arXiv Detail & Related papers (2025-09-26T17:59:59Z) - MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z) - AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.<n>Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.<n>We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z) - Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking
Head Generation Using Phonetic Posteriorgrams [58.617181880383605]
In this work, we propose a novel approach using phonetic posteriorgrams.
Our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches.
Our model is the first to support multilingual/mixlingual speech as input with convincing results.
arXiv Detail & Related papers (2020-06-20T16:32:43Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.