MCAD: Multimodal Context-Aware Audio Description Generation For Soccer
- URL: http://arxiv.org/abs/2511.09448v1
- Date: Thu, 13 Nov 2025 01:55:15 GMT
- Title: MCAD: Multimodal Context-Aware Audio Description Generation For Soccer
- Authors: Lipisha Chaudhary, Trisha Mittal, Subhadra Gopalakrishnan, Ifeoma Nwogu, Jaclyn Pytlarz,
- Abstract summary: We present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports.<n>We fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD.<n>We introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD.
- Score: 8.83668236549788
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. To address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events and actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned VideoLLM, allow the system to produce complete AD text for each video segment. We further introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD. ARGE-AD evaluates the generated AD for the presence of five characteristics: (i) usage of people's names, (ii) mention of actions and events, (iii) appropriate length of AD, (iv) absence of pronouns, and (v) overlap from commentary or subtitles. We present an in-depth analysis of our approach on both movie and soccer datasets. We also validate the use of this metric to quantitatively comment on the quality of generated AD using our metric across domains. Additionally, we contribute audio descriptions for 100 soccer game clips annotated by two AD experts.
Related papers
- More than a Moment: Towards Coherent Sequences of Audio Descriptions [88.14731697642098]
Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos.<n>Most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions.<n>We propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence.
arXiv Detail & Related papers (2025-10-29T12:06:42Z) - What You See is What You Ask: Evaluating Audio Descriptions [27.76958202277314]
We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments.<n>We show that current AD generation methods lag far behind human-authored ADs.
arXiv Detail & Related papers (2025-10-01T12:14:15Z) - DistinctAD: Distinctive Audio Description Generation in Contexts [62.58375366359421]
We propose DistinctAD, a framework for generating Audio Descriptions that emphasize distinctiveness to produce better narratives.<n>To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora.<n>In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context.
arXiv Detail & Related papers (2024-11-27T09:54:59Z) - AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these.
We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models.
We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z) - Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video content, like movies.<n>With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name.<n>We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z) - AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description [95.70092272297704]
We develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech.
We demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.
arXiv Detail & Related papers (2023-10-10T17:59:53Z) - AutoAD: Movie Description in Context [91.98603496476215]
This paper presents an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
We leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation.
arXiv Detail & Related papers (2023-03-29T17:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.