AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
- URL: http://arxiv.org/abs/2310.06838v1
- Date: Tue, 10 Oct 2023 17:59:53 GMT
- Title: AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
- Authors: Tengda Han, Max Bain, Arsha Nagrani, G\"ul Varol, Weidi Xie, Andrew
Zisserman
- Abstract summary: We develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech.
We demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.
- Score: 95.70092272297704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio Description (AD) is the task of generating descriptions of visual
content, at suitable time intervals, for the benefit of visually impaired
audiences. For movies, this presents notable challenges -- AD must occur only
during existing pauses in dialogue, should refer to characters by name, and
ought to aid understanding of the storyline as a whole. To this end, we develop
a new model for automatically generating movie AD, given CLIP visual features
of the frames, the cast list, and the temporal locations of the speech;
addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we
introduce a character bank consisting of the character's name, the actor that
played the part, and a CLIP feature of their face, for the principal cast of
each movie, and demonstrate how this can be used to improve naming in the
generated AD; (ii) when -- we investigate several models for determining
whether an AD should be generated for a time interval or not, based on the
visual content of the interval and its neighbours; and (iii) what -- we
implement a new vision-language model for this task, that can ingest the
proposals from the character bank, whilst conditioning on the visual features
using cross-attention, and demonstrate how this improves over previous
architectures for AD text generation in an apples-to-apples comparison.
Related papers
- AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description [92.72058446133468]
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner.
We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs)
Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
arXiv Detail & Related papers (2024-07-22T17:59:56Z) - AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these.
We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models.
We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z) - Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie.
With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name.
We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z) - AutoAD: Movie Description in Context [91.98603496476215]
This paper presents an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
We leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation.
arXiv Detail & Related papers (2023-03-29T17:59:58Z) - Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips.
One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model.
Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.