Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data
- URL: http://arxiv.org/abs/2006.01595v1
- Date: Fri, 29 May 2020 01:30:14 GMT
- Title: Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data
- Authors: Haytham M. Fayek and Anurag Kumar
- Abstract summary: We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings.
Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model.
- Score: 9.072124914105325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing sounds is a key aspect of computational audio scene analysis and
machine perception. In this paper, we advocate that sound recognition is
inherently a multi-modal audiovisual task in that it is easier to differentiate
sounds using both the audio and visual modalities as opposed to one or the
other. We present an audiovisual fusion model that learns to recognize sounds
from weakly labeled video recordings. The proposed fusion model utilizes an
attention mechanism to dynamically combine the outputs of the individual audio
and visual models. Experiments on the large scale sound events dataset,
AudioSet, demonstrate the efficacy of the proposed model, which outperforms the
single-modal models, and state-of-the-art fusion and multi-modal models. We
achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming
prior state of the art by approximately +4.35 mAP (relative: 10.4%).
Related papers
- Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations [16.269123889392343]
This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations.
Empirical results on ten diverse audio recognition downstream tasks show that the proposed models consistently outperform comparable self-supervised audio spectrogram transformer baselines.
arXiv Detail & Related papers (2024-06-04T10:19:14Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations.
Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks.
For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Contrastive Environmental Sound Representation Learning [6.85316573653194]
We exploit the self-supervised contrastive technique and a shallow 1D CNN to extract the distinctive audio features (audio representations) without using any explicit annotations.
We generate representations of a given audio using both its raw audio waveform and spectrogram and evaluate if the proposed learner is agnostic to the type of audio input.
arXiv Detail & Related papers (2022-07-18T16:56:30Z) - A Single Self-Supervised Model for Many Speech Modalities Enables
Zero-Shot Modality Transfer [31.028408352051684]
We present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech.
Our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.
arXiv Detail & Related papers (2022-07-14T16:21:33Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.