CL4AC: A Contrastive Loss for Audio Captioning
- URL: http://arxiv.org/abs/2107.09990v1
- Date: Wed, 21 Jul 2021 10:13:02 GMT
- Title: CL4AC: A Contrastive Loss for Audio Captioning
- Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D.
Plumbley and Wenwu Wang
- Abstract summary: We propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC)
In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts.
Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.
- Score: 43.83939284740561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated Audio captioning (AAC) is a cross-modal translation task that aims
to use natural language to describe the content of an audio clip. As shown in
the submissions received for Task 6 of the DCASE 2021 Challenges, this problem
has received increasing interest in the community. The existing AAC systems are
usually based on an encoder-decoder architecture, where the audio signal is
encoded into a latent representation, and aligned with its corresponding text
descriptions, then a decoder is used to generate the captions. However,
training of an AAC system often encounters the problem of data scarcity, which
may lead to inaccurate representation and audio-text alignment. To address this
problem, we propose a novel encoder-decoder framework called Contrastive Loss
for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived
from the original audio-text paired data are used to exploit the
correspondences between audio and texts by contrasting samples, which can
improve the quality of latent representation and the alignment between audio
and texts, while trained with limited data. Experiments are performed on the
Clotho dataset to show the effectiveness of our proposed approach.
Related papers
- DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning [13.601154787754046]
DRCap is a data-efficient and flexible zero-shot audio captioning system.
It requires text-only data for training and can quickly adapt to new domains without additional fine-tuning.
arXiv Detail & Related papers (2024-10-12T10:21:00Z) - Improving Text-To-Audio Models with Synthetic Captions [51.19111942748637]
We propose an audio captioning pipeline that uses an textitaudio language model to synthesize accurate and diverse captions for audio at scale.
We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named textttAF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions.
arXiv Detail & Related papers (2024-06-18T00:02:15Z) - Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - RECAP: Retrieval-Augmented Audio Captioning [46.27383142898749]
We present RECAP, a novel and effective audio captioning system that generates captions conditioned on an input audio.
Our proposed method can transfer to any domain without the need for any additional fine-tuning.
To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.
arXiv Detail & Related papers (2023-09-18T14:53:08Z) - Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - Interactive Audio-text Representation for Automated Audio Captioning
with Contrastive Learning [25.06635361326706]
We propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation.
The proposed CLIP-AAC introduces an audio-head and a text-head in the pre-trained encoder to extract audio-text information.
We also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions.
arXiv Detail & Related papers (2022-03-29T13:06:46Z) - Local Information Assisted Attention-free Decoder for Audio Captioning [52.191658157204856]
We present an AAC method with an attention-free decoder, where an encoder based on PANNs is employed for audio feature extraction.
The proposed method enables the effective use of both global and local information from audio signals.
arXiv Detail & Related papers (2022-01-10T08:55:52Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.