Weakly-supervised Automated Audio Captioning via text only training
- URL: http://arxiv.org/abs/2309.12242v1
- Date: Thu, 21 Sep 2023 16:40:46 GMT
- Title: Weakly-supervised Automated Audio Captioning via text only training
- Authors: Theodoros Kouzelis and Vassilis Katsouros
- Abstract summary: We propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model.
We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to $83%$ compared to fully supervised approaches.
- Score: 1.504795651143257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, datasets of paired audio and captions have enabled
remarkable success in automatically generating descriptions for audio clips,
namely Automated Audio Captioning (AAC). However, it is labor-intensive and
time-consuming to collect a sufficient number of paired audio and captions.
Motivated by the recent advances in Contrastive Language-Audio Pretraining
(CLAP), we propose a weakly-supervised approach to train an AAC model assuming
only text data and a pre-trained CLAP model, alleviating the need for paired
target data. Our approach leverages the similarity between audio and text
embeddings in CLAP. During training, we learn to reconstruct the text from the
CLAP text embedding, and during inference, we decode using the audio
embeddings. To mitigate the modality gap between the audio and text embeddings
we employ strategies to bridge the gap during training and inference stages. We
evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its
ability to achieve a relative performance of up to ~$83\%$ compared to fully
supervised approaches trained with paired target data.
Related papers
- Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Interactive Audio-text Representation for Automated Audio Captioning
with Contrastive Learning [25.06635361326706]
We propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation.
The proposed CLIP-AAC introduces an audio-head and a text-head in the pre-trained encoder to extract audio-text information.
We also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions.
arXiv Detail & Related papers (2022-03-29T13:06:46Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.