AutoAD: Movie Description in Context
- URL: http://arxiv.org/abs/2303.16899v1
- Date: Wed, 29 Mar 2023 17:59:58 GMT
- Title: AutoAD: Movie Description in Context
- Authors: Tengda Han, Max Bain, Arsha Nagrani, G\"ul Varol, Weidi Xie, Andrew
Zisserman
- Abstract summary: This paper presents an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
We leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation.
- Score: 91.98603496476215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this paper is an automatic Audio Description (AD) model that
ingests movies and outputs AD in text form. Generating high-quality movie AD is
challenging due to the dependency of the descriptions on context, and the
limited amount of training data available. In this work, we leverage the power
of pretrained foundation models, such as GPT and CLIP, and only train a mapping
network that bridges the two models for visually-conditioned text generation.
In order to obtain high-quality AD, we make the following four contributions:
(i) we incorporate context from the movie clip, AD from previous clips, as well
as the subtitles; (ii) we address the lack of training data by pretraining on
large-scale datasets, where visual or contextual information is unavailable,
e.g. text-only AD without movies or visual captioning datasets without context;
(iii) we improve on the currently available AD datasets, by removing label
noise in the MAD dataset, and adding character naming information; and (iv) we
obtain strong results on the movie AD task compared with previous methods.
Related papers
- AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description [92.72058446133468]
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner.
We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs)
Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
arXiv Detail & Related papers (2024-07-22T17:59:56Z) - AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these.
We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models.
We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z) - Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie.
With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name.
We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z) - AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description [95.70092272297704]
We develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech.
We demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.
arXiv Detail & Related papers (2023-10-10T17:59:53Z) - "Let's not Quote out of Context": Unified Vision-Language Pretraining
for Context Assisted Image Captioning [40.01197694624958]
We propose a new unified Vision-Language (VL) model based on the One For All (OFA) model.
Our approach aims to overcome the context-independent (image and text are treated independently) nature of the existing approaches.
Our system achieves state-of-the-art results with an improvement of up to 8.34 CIDEr score on the benchmark news image captioning datasets.
arXiv Detail & Related papers (2023-06-01T17:34:25Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.