Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
- URL: http://arxiv.org/abs/2505.06803v1
- Date: Sun, 11 May 2025 01:01:44 GMT
- Title: Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
- Authors: Xilin Jiang, Junkai Wu, Vishal Choudhari, Nima Mesgarani,
- Abstract summary: We evaluate audio, visual, and audio-visual large language models (LLMs) against humans in recognizing sound objects.<n>We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes.
- Score: 13.137446396934102
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, against humans in recognizing sound objects of different classes from audio-only, silent video, or sounded video inputs. We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes. To reduce this gap, we introduce a cross-modal distillation framework, where an LLM in one modality serves as the teacher and another as the student, with knowledge transfer in sound classes predicted as more challenging to the student by a heuristic model. Distillation in both directions, from Qwen2-VL to Qwen2-Audio and vice versa, leads to notable improvements, particularly in challenging classes. This work highlights the sensory gap in LLMs from a human-aligned perspective and proposes a principled approach to enhancing modality-specific perception in multimodal LLMs.
Related papers
- EgoSound: Benchmarking Sound Understanding in Egocentric Videos [68.1897133235638]
We introduce EgoSound, the first benchmark designed to evaluate egocentric sound understanding in MLLMs.<n>EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences.<n>It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning.
arXiv Detail & Related papers (2026-02-15T12:46:35Z) - When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? [41.579901082251254]
Experimental results reveal that Multimodal Large Language Models (MLLMs) struggle to discriminate non-existent audio due to visually dominated reasoning.<n>Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM.<n> RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning.
arXiv Detail & Related papers (2025-11-13T07:59:41Z) - SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models [18.802543558300044]
We present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student.<n>Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set and in unseen auditory scenes and questions.
arXiv Detail & Related papers (2025-09-19T06:39:39Z) - SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z) - FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing [78.83988199306901]
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects.<n>Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality.<n>We propose FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning.
arXiv Detail & Related papers (2025-05-02T13:30:19Z) - AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models [27.430040932849018]
We introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual models.<n>Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities.<n>We demonstrate that simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.
arXiv Detail & Related papers (2024-10-23T23:36:06Z) - With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models [16.583370726582356]
We show that Vision Language Models (VLMs) can implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone.
We perform experiments including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks.
Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation.
arXiv Detail & Related papers (2024-09-23T11:13:25Z) - Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances [3.396456345114466]
We propose SpeechCueLLM, a method that translates speech characteristics into natural language descriptions.<n>We evaluate SpeechCueLLM on two datasets: IEMOCAP and MELD, showing significant improvements in emotion recognition accuracy.
arXiv Detail & Related papers (2024-07-31T03:53:14Z) - Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models [49.87432626548563]
We introduce methods to assess the extent of object hallucination of publicly available LALMs.
Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content.
We explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.
arXiv Detail & Related papers (2024-06-12T16:51:54Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations.
Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks.
For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.