Zero-Shot Audio Captioning via Audibility Guidance
- URL: http://arxiv.org/abs/2309.03884v1
- Date: Thu, 7 Sep 2023 17:45:58 GMT
- Title: Zero-Shot Audio Captioning via Audibility Guidance
- Authors: Tal Shaharabany, Ariel Shaulov and Lior Wolf
- Abstract summary: We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
- Score: 57.70351255180495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of audio captioning is similar in essence to tasks such as image and
video captioning. However, it has received much less attention. We propose
three desiderata for captioning audio -- (i) fluency of the generated text,
(ii) faithfulness of the generated text to the input audio, and the somewhat
related (iii) audibility, which is the quality of being able to be perceived
based only on audio. Our method is a zero-shot method, i.e., we do not learn to
perform captioning. Instead, captioning occurs as an inference process that
involves three networks that correspond to the three desired qualities: (i) A
Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A
model that provides a matching score between an audio file and a text, for
which we use a multimodal matching network called ImageBind, and (iii) A text
classifier, trained using a dataset we collected automatically by instructing
GPT-4 with prompts designed to direct the generation of both audible and
inaudible sentences. We present our results on the AudioCap dataset,
demonstrating that audibility guidance significantly enhances performance
compared to the baseline, which lacks this objective.
Related papers
- ADIFF: Explaining audio difference using natural language [31.963783032080993]
This paper comprehensively studies the task of explaining audio differences and then propose benchmark, baselines for the task.
We present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets.
We propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model's ability to produce detailed explanations.
arXiv Detail & Related papers (2025-02-06T20:00:43Z) - Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.
Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.
Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z) - Improving Text-To-Audio Models with Synthetic Captions [51.19111942748637]
We propose an audio captioning pipeline that uses an textitaudio language model to synthesize accurate and diverse captions for audio at scale.
We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named textttAF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions.
arXiv Detail & Related papers (2024-06-18T00:02:15Z) - Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature
Alignment [16.304894187743013]
TEFAL is a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query.
Our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately.
arXiv Detail & Related papers (2023-07-24T17:43:13Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.