Related papers: More than a Moment: Towards Coherent Sequences of Audio Descriptions

More than a Moment: Towards Coherent Sequences of Audio Descriptions

URL: http://arxiv.org/abs/2510.25440v1
Date: Wed, 29 Oct 2025 12:06:42 GMT
Title: More than a Moment: Towards Coherent Sequences of Audio Descriptions
Authors: Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, Gül Varol, Makarand Tapaswi,
Abstract summary: Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos.<n>Most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions.<n>We propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence.
Score: 88.14731697642098
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

Related papers

MCAD: Multimodal Context-Aware Audio Description Generation For Soccer [8.83668236549788]
We present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports.<n>We fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD.<n>We introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD.
arXiv Detail & Related papers (2025-11-12T16:05:05Z)
What You See is What You Ask: Evaluating Audio Descriptions [27.76958202277314]
We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments.<n>We show that current AD generation methods lag far behind human-authored ADs.
arXiv Detail & Related papers (2025-10-01T12:14:15Z)
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [110.79299467093006]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding.<n>This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures.<n>Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z)
DistinctAD: Distinctive Audio Description Generation in Contexts [62.58375366359421]
We propose DistinctAD, a framework for generating Audio Descriptions that emphasize distinctiveness to produce better narratives.<n>To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora.<n>In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context.
arXiv Detail & Related papers (2024-11-27T09:54:59Z)
LLM-AD: Large Language Model based Audio Description System [5.319096768490139]
This paper introduces an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision) It produces ADs that comply with established natural language AD production standards and maintain contextually consistent character information across frames. A thorough analysis on the MAD dataset reveals that our approach achieves a performance on par with learning-based methods in automated AD production, as substantiated by a CIDEr score of 20.5.
arXiv Detail & Related papers (2024-05-02T03:38:58Z)
Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video content, like movies.<n>With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name.<n>We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z)
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning [120.95150400119705]
We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD) MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. We introduce the first segment-based evaluator for recurrent text generation.
arXiv Detail & Related papers (2023-11-29T08:27:00Z)
AutoAD: Movie Description in Context [91.98603496476215]
This paper presents an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. We leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation.
arXiv Detail & Related papers (2023-03-29T17:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.