Related papers: NowYouSee Me: Context-Aware Automatic Audio Description

NowYouSee Me: Context-Aware Automatic Audio Description

URL: http://arxiv.org/abs/2412.10002v1
Date: Fri, 13 Dec 2024 09:40:37 GMT
Title: NowYouSee Me: Context-Aware Automatic Audio Description
Authors: Seon-Ho Lee, Jue Wang, David Fan, Zhikang Zhang, Linda Liu, Xiang Hao, Vimal Bhat, Xinyu Li,
Abstract summary: We introduce $mathrmCA3D$, the pioneering unified Context-Aware Automatic Audio Description system.<n>The proposed $mathrmCA3D$ is the first end-to-end trainable system that only uses visual cue.
Score: 19.232338111340148
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Audio Description (AD) plays a pivotal role as an application system aimed at guaranteeing accessibility in multimedia content, which provides additional narrations at suitable intervals to describe visual elements, catering specifically to the needs of visually impaired audiences. In this paper, we introduce $\mathrm{CA^3D}$, the pioneering unified Context-Aware Automatic Audio Description system that provides AD event scripts with precise locations in the long cinematic content. Specifically, $\mathrm{CA^3D}$ system consists of: 1) a Temporal Feature Enhancement Module to efficiently capture longer term dependencies, 2) an anchor-based AD event detector with feature suppression module that localizes the AD events and extracts discriminative feature for AD generation, and 3) a self-refinement module that leverages the generated output to tweak AD event boundaries from coarse to fine. Unlike conventional methods which rely on metadata and ground truth AD timestamp for AD detection and generation tasks, the proposed $\mathrm{CA^3D}$ is the first end-to-end trainable system that only uses visual cue. Extensive experiments demonstrate that the proposed $\mathrm{CA^3D}$ improves existing architectures for both AD event detection and script generation metrics, establishing the new state-of-the-art performances in the AD automation.

Related papers

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [94.23160400824969]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures. Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z)
DistinctAD: Distinctive Audio Description Generation in Contexts [62.58375366359421]
We propose DistinctAD, a framework for generating Audio Descriptions that emphasize distinctiveness to produce better narratives.<n>To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora.<n>In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context.
arXiv Detail & Related papers (2024-11-27T09:54:59Z)
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description [92.72058446133468]
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs) Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
arXiv Detail & Related papers (2024-07-22T17:59:56Z)
AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models. We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z)
Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie. With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name. We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z)
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description [95.70092272297704]
We develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech. We demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.
arXiv Detail & Related papers (2023-10-10T17:59:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.