AudioCLIP: Extending CLIP to Image, Text and Audio
- URL: http://arxiv.org/abs/2106.13043v1
- Date: Thu, 24 Jun 2021 14:16:38 GMT
- Title: AudioCLIP: Extending CLIP to Image, Text and Audio
- Authors: Andrey Guzhov, Federico Raue, J\"orn Hees, Andreas Dengel
- Abstract summary: We present an extension of the CLIP model that handles audio in addition to text and images.
Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset.
AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task.
- Score: 6.585049648605185
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the past, the rapidly evolving field of sound classification greatly
benefited from the application of methods from other domains. Today, we observe
the trend to fuse domain-specific tasks and approaches together, which provides
the community with new outstanding models.
In this work, we present an extension of the CLIP model that handles audio in
addition to text and images. Our proposed model incorporates the ESResNeXt
audio-model into the CLIP framework using the AudioSet dataset. Such a
combination enables the proposed model to perform bimodal and unimodal
classification and querying, while keeping CLIP's ability to generalize to
unseen datasets in a zero-shot inference fashion.
AudioCLIP achieves new state-of-the-art results in the Environmental Sound
Classification (ESC) task, out-performing other approaches by reaching
accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets.
Further it sets new baselines in the zero-shot ESC-task on the same datasets
68.78% and 69.40%, respectively).
Finally, we also assess the cross-modal querying performance of the proposed
model as well as the influence of full and partial training on the results. For
the sake of reproducibility, our code is published.
Related papers
- Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic
Spaces [10.895310812568084]
We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces.
Results show that the proposed model is sensible to phonetic changes.
We provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications.
arXiv Detail & Related papers (2023-07-23T22:18:47Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z) - PSLA: Improving Audio Event Classification with Pretraining, Sampling,
Labeling, and Aggregation [19.09439093130855]
We present PSLA, a collection of training techniques that can noticeably boost the model accuracy.
We obtain a model that achieves a new state-of-the-art mean average precision (mAP) of 0.474 on AudioSet, outperforming the previous best system of 0.439.
arXiv Detail & Related papers (2021-02-02T01:00:38Z) - ESResNet: Environmental Sound Classification Based on Visual Domain
Models [4.266320191208303]
We present a model that is inherently compatible with mono and stereo sound inputs.
We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets.
arXiv Detail & Related papers (2020-04-15T19:07:55Z) - Acoustic Scene Classification Using Bilinear Pooling on Time-liked and
Frequency-liked Convolution Neural Network [4.131608702779222]
This paper explores the use of harmonic and percussive source separation (HPSS) to split the audio into harmonic audio and percussive audio.
The deep features extracted from these 2 CNNs will then be combined using bilinear pooling.
The model is being evaluated on DCASE 2019 sub task 1a dataset and scored an average of 65% on development dataset.
arXiv Detail & Related papers (2020-02-14T04:06:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.