Related papers: Zero-Shot Audio Captioning via Audibility Guidance

Zero-Shot Audio Captioning via Audibility Guidance

URL: http://arxiv.org/abs/2309.03884v1
Date: Thu, 7 Sep 2023 17:45:58 GMT
Title: Zero-Shot Audio Captioning via Audibility Guidance
Authors: Tal Shaharabany, Ariel Shaulov and Lior Wolf
Abstract summary: We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility. Our method is a zero-shot method, i.e., we do not learn to perform captioning. We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
Score: 57.70351255180495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and the somewhat related (iii) audibility, which is the quality of being able to be perceived based only on audio. Our method is a zero-shot method, i.e., we do not learn to perform captioning. Instead, captioning occurs as an inference process that involves three networks that correspond to the three desired qualities: (i) A Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A model that provides a matching score between an audio file and a text, for which we use a multimodal matching network called ImageBind, and (iii) A text classifier, trained using a dataset we collected automatically by instructing GPT-4 with prompts designed to direct the generation of both audible and inaudible sentences. We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline, which lacks this objective.

Related papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos [16.213708405651644]
LG-CAV-MAE integrates a pretrained text encoder into contrastive audio-visual masked autoencoders.<n>To train LG-CAV-MAE, we introduce an automatic method to generate audio-visual-text triplets from unlabeled videos.<n>This approach yields high-quality audio-visual-text triplets without requiring manual annotations.
arXiv Detail & Related papers (2025-07-16T06:58:14Z)
Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context [45.56363286769136]
We introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently. Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content. We propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering.
arXiv Detail & Related papers (2025-03-19T15:34:21Z)
ADIFF: Explaining audio difference using natural language [31.963783032080993]
This paper comprehensively studies the task of explaining audio differences and then propose benchmark, baselines for the task. We present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets. We propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model's ability to produce detailed explanations.
arXiv Detail & Related papers (2025-02-06T20:00:43Z)
Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z)
Improving Text-To-Audio Models with Synthetic Captions [51.19111942748637]
We propose an audio captioning pipeline that uses an textitaudio language model to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named textttAF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions.
arXiv Detail & Related papers (2024-06-18T00:02:15Z)
Zero-shot audio captioning with audio-language model guidance and audio context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training. Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z)
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment [16.304894187743013]
TEFAL is a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query. Our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately.
arXiv Detail & Related papers (2023-07-24T17:43:13Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z)
Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform. We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z)
Audio Captioning using Gated Recurrent Units [1.3960152426268766]
VGGish audio embedding model is used to explore the usability of audio embeddings in the audio captioning task. The proposed architecture encodes audio and text input modalities separately and combines them before the decoding stage. Our experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
arXiv Detail & Related papers (2020-06-05T12:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.