Related papers: AutoAD III: The Prequel -- Back to the Pixels

AutoAD III: The Prequel -- Back to the Pixels

URL: http://arxiv.org/abs/2404.14412v1
Date: Mon, 22 Apr 2024 17:59:57 GMT
Title: AutoAD III: The Prequel -- Back to the Pixels
Authors: Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman,
Abstract summary: We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models. We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
Score: 96.27059234129788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.

Related papers

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [94.23160400824969]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures. Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z)
NowYouSee Me: Context-Aware Automatic Audio Description [19.232338111340148]
We introduce $mathrmCA3D$, the pioneering unified Context-Aware Automatic Audio Description system. The proposed $mathrmCA3D$ is the first end-to-end trainable system that only uses visual cue.
arXiv Detail & Related papers (2024-12-13T09:40:37Z)
DistinctAD: Distinctive Audio Description Generation in Contexts [62.58375366359421]
We propose DistinctAD, a framework for generating Audio Descriptions that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context.
arXiv Detail & Related papers (2024-11-27T09:54:59Z)
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description [92.72058446133468]
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs) Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
arXiv Detail & Related papers (2024-07-22T17:59:56Z)
Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie. With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name. We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z)
Video Annotator: A framework for efficiently building video classifiers using vision-language models and active learning [0.0]
Video Annotator (VA) is a framework for annotating, managing, and iterating on video classification datasets. VA allows for a continuous annotation process, seamlessly integrating data collection and model training. VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline.
arXiv Detail & Related papers (2024-02-09T17:19:05Z)
Genie: Achieving Human Parity in Content-Grounded Datasets Generation [15.535753443076002]
We propose Genie, a novel method for automatically generating high-quality content-grounded data. We showcase this methodology by generating three large-scale synthetic data. In a human evaluation, our generated data was found to be natural and of high quality.
arXiv Detail & Related papers (2024-01-25T18:14:57Z)
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description [95.70092272297704]
We develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech. We demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.
arXiv Detail & Related papers (2023-10-10T17:59:53Z)
AutoAD: Movie Description in Context [91.98603496476215]
This paper presents an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. We leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation.
arXiv Detail & Related papers (2023-03-29T17:59:58Z)
AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks. We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate. We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.