Related papers: LLM-AD: Large Language Model based Audio Description System

LLM-AD: Large Language Model based Audio Description System

URL: http://arxiv.org/abs/2405.00983v1
Date: Thu, 2 May 2024 03:38:58 GMT
Title: LLM-AD: Large Language Model based Audio Description System
Authors: Peng Chu, Jiang Wang, Andre Abrantes,
Abstract summary: This paper introduces an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision) It produces ADs that comply with established natural language AD production standards and maintain contextually consistent character information across frames. A thorough analysis on the MAD dataset reveals that our approach achieves a performance on par with learning-based methods in automated AD production, as substantiated by a CIDEr score of 20.5.
Score: 5.319096768490139
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of Audio Description (AD) has been a pivotal step forward in making video content more accessible and inclusive. Traditionally, AD production has demanded a considerable amount of skilled labor, while existing automated approaches still necessitate extensive training to integrate multimodal inputs and tailor the output from a captioning style to an AD style. In this paper, we introduce an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision). Notably, our methodology employs readily available components, eliminating the need for additional training. It produces ADs that not only comply with established natural language AD production standards but also maintain contextually consistent character information across frames, courtesy of a tracking-based character recognition module. A thorough analysis on the MAD dataset reveals that our approach achieves a performance on par with learning-based methods in automated AD production, as substantiated by a CIDEr score of 20.5.

Related papers

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [94.23160400824969]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures. Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z)
DistinctAD: Distinctive Audio Description Generation in Contexts [62.58375366359421]
We propose DistinctAD, a framework for generating Audio Descriptions that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context.
arXiv Detail & Related papers (2024-11-27T09:54:59Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies [3.6481982339272925]
Audio descriptions (ADs) function as acoustic commentaries designed to assist blind persons and persons with visual impairments in accessing digital media content. Recent advancements in natural language processing (NLP) and computer vision (CV) have allowed for getting a step closer to automatic AD generation. This paper reviews the technologies pertinent to AD generation in the era of large language models (LLMs) and vision-language models (VLMs)
arXiv Detail & Related papers (2024-10-11T14:40:51Z)
Large Language Models for Human-like Autonomous Driving: A Survey [7.125039718268125]
Large Language Models (LLMs) are AI models trained on massive text corpora with remarkable language understanding and generation capabilities. This survey provides a review of progress in leveraging LLMs for Autonomous Driving. It focuses on their applications in modular AD pipelines and end-to-end AD systems.
arXiv Detail & Related papers (2024-07-27T15:24:11Z)
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description [92.72058446133468]
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs) Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
arXiv Detail & Related papers (2024-07-22T17:59:56Z)
Training Zero-Shot Generalizable End-to-End Task-Oriented Dialog System Without Turn-level Dialog Annotations [2.757798192967912]
This work employs multi-task instruction fine-tuning to create more efficient and scalable task-oriented dialogue systems. Our approach outperforms both state-of-the-art models trained on annotated data and billion-scale parameter off-the-shelf ChatGPT models.
arXiv Detail & Related papers (2024-07-21T04:52:38Z)
AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models. We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z)
Contextual AD Narration with Interleaved Multimodal Sequence [50.240534605090396]
The task aims to generate descriptions of visual elements for visually impaired individuals to help them access long-form video contents, like movie. With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name. We propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs.
arXiv Detail & Related papers (2024-03-19T17:27:55Z)
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning [120.95150400119705]
We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD) MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. We introduce the first segment-based evaluator for recurrent text generation.
arXiv Detail & Related papers (2023-11-29T08:27:00Z)
AutoAD: Movie Description in Context [91.98603496476215]
This paper presents an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. We leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation.
arXiv Detail & Related papers (2023-03-29T17:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.