Related papers: Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

URL: http://arxiv.org/abs/2410.08860v1
Date: Fri, 11 Oct 2024 14:40:51 GMT
Title: Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies
Authors: Yingqiang Gao, Lukas Fischer, Alexa Lintner, Sarah Ebling,
Abstract summary: Audio descriptions (ADs) function as acoustic commentaries designed to assist blind persons and persons with visual impairments in accessing digital media content. Recent advancements in natural language processing (NLP) and computer vision (CV) have allowed for getting a step closer to automatic AD generation. This paper reviews the technologies pertinent to AD generation in the era of large language models (LLMs) and vision-language models (VLMs)
Score: 3.6481982339272925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio descriptions (ADs) function as acoustic commentaries designed to assist blind persons and persons with visual impairments in accessing digital media content on television and in movies, among other settings. As an accessibility service typically provided by trained AD professionals, the generation of ADs demands significant human effort, making the process both time-consuming and costly. Recent advancements in natural language processing (NLP) and computer vision (CV), particularly in large language models (LLMs) and vision-language models (VLMs), have allowed for getting a step closer to automatic AD generation. This paper reviews the technologies pertinent to AD generation in the era of LLMs and VLMs: we discuss how state-of-the-art NLP and CV technologies can be applied to generate ADs and identify essential research directions for the future.

Related papers

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [94.23160400824969]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures. Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z)
Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook [85.43403500874889]
Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI) Recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains.
arXiv Detail & Related papers (2025-03-23T10:33:28Z)
Reflection on Data Storytelling Tools in the Generative AI Era from the Human-AI Collaboration Perspective [54.152639172274]
Large-scale generative AI techniques have the potential to enhance data storytelling with their power in visual and narration generation.<n>We compare the collaboration patterns of the latest tools with those of earlier ones using a dedicated framework for understanding human-AI collaboration in data storytelling.<n>The benefits of these AI techniques and implications to human-AI collaboration are also revealed.
arXiv Detail & Related papers (2025-03-04T13:56:18Z)
AD-LLM: Benchmarking Large Language Models for Anomaly Detection [50.57641458208208]
This paper introduces AD-LLM, the first benchmark that evaluates how large language models can help with anomaly detection. We examine three key tasks: zero-shot detection, using LLMs' pre-trained knowledge to perform AD without tasks-specific training; data augmentation, generating synthetic data and category descriptions to improve AD models; and model selection, using LLMs to suggest unsupervised AD models.
arXiv Detail & Related papers (2024-12-15T10:22:14Z)
Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z)
Large Language Models for Human-like Autonomous Driving: A Survey [7.125039718268125]
Large Language Models (LLMs) are AI models trained on massive text corpora with remarkable language understanding and generation capabilities. This survey provides a review of progress in leveraging LLMs for Autonomous Driving. It focuses on their applications in modular AD pipelines and end-to-end AD systems.
arXiv Detail & Related papers (2024-07-27T15:24:11Z)
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description [92.72058446133468]
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs) Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
arXiv Detail & Related papers (2024-07-22T17:59:56Z)
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models [71.25225058845324]
Large Language Models (LLMs) have demonstrated revolutionary abilities in language understanding and generation. Retrieval-Augmented Generation (RAG) can offer reliable and up-to-date external knowledge. RA-LLMs have emerged to harness external and authoritative knowledge bases, rather than relying on the model's internal knowledge.
arXiv Detail & Related papers (2024-05-10T02:48:45Z)
LLM-AD: Large Language Model based Audio Description System [5.319096768490139]
This paper introduces an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision) It produces ADs that comply with established natural language AD production standards and maintain contextually consistent character information across frames. A thorough analysis on the MAD dataset reveals that our approach achieves a performance on par with learning-based methods in automated AD production, as substantiated by a CIDEr score of 20.5.
arXiv Detail & Related papers (2024-05-02T03:38:58Z)
AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models. We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z)
A Review of Multi-Modal Large Language and Vision Models [1.9685736810241874]
Large Language Models (LLMs) have emerged as a focal point of research and application. Recently, LLMs have been extended into multi-modal large language models (MM-LLMs) This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs.
arXiv Detail & Related papers (2024-03-28T15:53:45Z)
A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming [26.082980156232086]
Top-trending AI technologies, i.e., generative artificial intelligence (Generative AI) and large language models (LLMs), are reshaping the field of video technology. The paper highlights the innovative use of these technologies in producing highly realistic videos. In the realm of video streaming, the paper discusses how LLMs contribute to more efficient and user-centric streaming experiences.
arXiv Detail & Related papers (2024-01-30T14:37:10Z)
Combatting Human Trafficking in the Cyberspace: A Natural Language Processing-Based Methodology to Analyze the Language in Online Advertisements [55.2480439325792]
This project tackles the pressing issue of human trafficking in online C2C marketplaces through advanced Natural Language Processing (NLP) techniques. We introduce a novel methodology for generating pseudo-labeled datasets with minimal supervision, serving as a rich resource for training state-of-the-art NLP models. A key contribution is the implementation of an interpretability framework using Integrated Gradients, providing explainable insights crucial for law enforcement.
arXiv Detail & Related papers (2023-11-22T02:45:01Z)
Recommender Systems in the Era of Large Language Models (LLMs) [62.0129013439038]
Large Language Models (LLMs) have revolutionized the fields of Natural Language Processing (NLP) and Artificial Intelligence (AI) We conduct a comprehensive review of LLM-empowered recommender systems from various aspects including Pre-training, Fine-tuning, and Prompting.
arXiv Detail & Related papers (2023-07-05T06:03:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.