DistinctAD: Distinctive Audio Description Generation in Contexts
- URL: http://arxiv.org/abs/2411.18180v1
- Date: Wed, 27 Nov 2024 09:54:59 GMT
- Title: DistinctAD: Distinctive Audio Description Generation in Contexts
- Authors: Bo Fang, Wenhao Wu, Qiangqiang Wu, Yuxin Song, Antoni B. Chan,
- Abstract summary: We propose DistinctAD, a framework for generating Audio Descriptions that emphasize distinctiveness to produce better narratives.
To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora.
In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context.
- Score: 62.58375366359421
- License:
- Abstract: Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the domain gap between movie-AD data and existing data used to train vision-language models, and ii) the issue of contextual redundancy arising from highly similar neighboring visual clips in a long movie. In this work, we propose DistinctAD, a novel two-stage framework for generating ADs that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora, enabling more effective alignment between movie and AD modalities at both global and fine-grained levels. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context, ensuring the prediction of unique terms specific to the current AD. Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.
Related papers
- AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description [92.72058446133468]
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner.
We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs)
Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
arXiv Detail & Related papers (2024-07-22T17:59:56Z) - LLM-AD: Large Language Model based Audio Description System [5.319096768490139]
This paper introduces an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision)
It produces ADs that comply with established natural language AD production standards and maintain contextually consistent character information across frames.
A thorough analysis on the MAD dataset reveals that our approach achieves a performance on par with learning-based methods in automated AD production, as substantiated by a CIDEr score of 20.5.
arXiv Detail & Related papers (2024-05-02T03:38:58Z) - AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these.
We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models.
We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z) - Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie.
With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name.
We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z) - Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation [27.695825570272874]
Conventional Unsupervised Domain Adaptation (UDA) strives to minimize distribution discrepancy between domains.
We propose Domain-Agnostic Mutual Prompting (DAMP) to exploit domain-invariant semantics.
Experiments on three UDA benchmarks demonstrate the superiority of DAMP over state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-05T12:06:48Z) - AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description [95.70092272297704]
We develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech.
We demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.
arXiv Detail & Related papers (2023-10-10T17:59:53Z) - AutoAD: Movie Description in Context [91.98603496476215]
This paper presents an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
We leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation.
arXiv Detail & Related papers (2023-03-29T17:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.