AutoAD III: The Prequel -- Back to the Pixels
- URL: http://arxiv.org/abs/2404.14412v1
- Date: Mon, 22 Apr 2024 17:59:57 GMT
- Title: AutoAD III: The Prequel -- Back to the Pixels
- Authors: Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman,
- Abstract summary: We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these.
We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models.
We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
- Score: 96.27059234129788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.
Related papers
- AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description [92.72058446133468]
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner.
We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs)
Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
arXiv Detail & Related papers (2024-07-22T17:59:56Z) - Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie.
With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name.
We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z) - Video Annotator: A framework for efficiently building video classifiers
using vision-language models and active learning [0.0]
Video Annotator (VA) is a framework for annotating, managing, and iterating on video classification datasets.
VA allows for a continuous annotation process, seamlessly integrating data collection and model training.
VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline.
arXiv Detail & Related papers (2024-02-09T17:19:05Z) - Genie: Achieving Human Parity in Content-Grounded Datasets Generation [15.535753443076002]
We propose Genie, a novel method for automatically generating high-quality content-grounded data.
We showcase this methodology by generating three large-scale synthetic data.
In a human evaluation, our generated data was found to be natural and of high quality.
arXiv Detail & Related papers (2024-01-25T18:14:57Z) - AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description [95.70092272297704]
We develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech.
We demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.
arXiv Detail & Related papers (2023-10-10T17:59:53Z) - AutoAD: Movie Description in Context [91.98603496476215]
This paper presents an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
We leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation.
arXiv Detail & Related papers (2023-03-29T17:59:58Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.