Related papers: What You See is What You Ask: Evaluating Audio Descriptions

What You See is What You Ask: Evaluating Audio Descriptions

URL: http://arxiv.org/abs/2510.00808v1
Date: Wed, 01 Oct 2025 12:14:15 GMT
Title: What You See is What You Ask: Evaluating Audio Descriptions
Authors: Divy Kala, Eshika Khandelwal, Makarand Tapaswi,
Abstract summary: We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments.<n>We show that current AD generation methods lag far behind human-authored ADs.
Score: 27.76958202277314
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.

Related papers

MCAD: Multimodal Context-Aware Audio Description Generation For Soccer [8.83668236549788]
We present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports.<n>We fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD.<n>We introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD.
arXiv Detail & Related papers (2025-11-12T16:05:05Z)
More than a Moment: Towards Coherent Sequences of Audio Descriptions [88.14731697642098]
Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos.<n>Most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions.<n>We propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence.
arXiv Detail & Related papers (2025-10-29T12:06:42Z)
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [110.79299467093006]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding.<n>This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures.<n>Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z)
DistinctAD: Distinctive Audio Description Generation in Contexts [62.58375366359421]
We propose DistinctAD, a framework for generating Audio Descriptions that emphasize distinctiveness to produce better narratives.<n>To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora.<n>In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context.
arXiv Detail & Related papers (2024-11-27T09:54:59Z)
AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models. We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z)
AutoAD: Movie Description in Context [91.98603496476215]
This paper presents an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. We leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation.
arXiv Detail & Related papers (2023-03-29T17:59:58Z)
Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos. Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.