Listening without Looking: Modality Bias in Audio-Visual Captioning
- URL: http://arxiv.org/abs/2510.24024v1
- Date: Tue, 28 Oct 2025 03:06:28 GMT
- Title: Listening without Looking: Modality Bias in Audio-Visual Captioning
- Authors: Yuchi Ishikawa, Toranosuke Manabe, Tatsuya Komatsu, Yoshimitsu Aoki,
- Abstract summary: We conduct modality tests on LAVCap, a state-of-the-art audio-visual captioning model.<n>The analysis reveals a pronounced bias toward the audio stream in LAVCap.<n>We augment AudioCaps with textual annotations that jointly describe the audio and visual streams.<n>The results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.
- Score: 26.155364752676167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.
Related papers
- ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing [128.8346376825612]
Key challenges of high-quality image captioning lie in the inherent biases of LVLMs.<n>We propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget.<n>Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks.
arXiv Detail & Related papers (2025-06-24T17:59:55Z) - Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning [37.17910848101769]
Current vision-guided audio captioning systems fail to address audiovisual misalignment in real-world scenarios.<n>We present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification.<n>We also develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs.
arXiv Detail & Related papers (2025-05-28T07:08:17Z) - LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport [16.108957027494604]
LAVCap is a large language model (LLM)-based audio-visual captioning framework.<n>It integrates visual information with audio to improve audio captioning performance.<n>It outperforms existing state-of-the-art methods on the AudioCaps dataset.
arXiv Detail & Related papers (2025-01-16T04:53:29Z) - AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning [24.608569008975497]
We propose AVCap, an Audio-Visual Captioning framework.
AVCap utilizes audio-visual features as text tokens.
Our method outperforms existing audio-visual captioning methods across all metrics.
arXiv Detail & Related papers (2024-07-10T16:17:49Z) - Improving Text-To-Audio Models with Synthetic Captions [51.19111942748637]
We propose an audio captioning pipeline that uses an textitaudio language model to synthesize accurate and diverse captions for audio at scale.
We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named textttAF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions.
arXiv Detail & Related papers (2024-06-18T00:02:15Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.