Self-supervised Contrastive Learning for Audio-Visual Action Recognition
- URL: http://arxiv.org/abs/2204.13386v2
- Date: Mon, 20 Mar 2023 07:33:09 GMT
- Title: Self-supervised Contrastive Learning for Audio-Visual Action Recognition
- Authors: Yang Liu, Ying Tan, Haoyuan Lan
- Abstract summary: The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos.
We propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (A), to learn discriminative audio-visual representations for action recognition.
- Score: 7.188231323934023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The underlying correlation between audio and visual modalities can be
utilized to learn supervised information for unlabeled videos. In this paper,
we propose an end-to-end self-supervised framework named Audio-Visual
Contrastive Learning (AVCL), to learn discriminative audio-visual
representations for action recognition. Specifically, we design an attention
based multi-modal fusion module (AMFM) to fuse audio and visual modalities. To
align heterogeneous audio-visual modalities, we construct a novel
co-correlation guided representation alignment module (CGRA). To learn
supervised information from unlabeled videos, we propose a novel
self-supervised contrastive learning module (SelfCL). Furthermore, we build a
new audio-visual action recognition dataset named Kinetics-Sounds100.
Experimental results on Kinetics-Sounds32 and Kinetics-Sounds100 datasets
demonstrate the superiority of our AVCL over the state-of-the-art methods on
large-scale action recognition benchmark.
Related papers
- SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos [77.55518265996312]
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
Our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree.
arXiv Detail & Related papers (2024-04-08T05:19:28Z) - Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - Contrastive Learning of General-Purpose Audio Representations [33.15189569532155]
We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio.
We build on recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement model of audio.
arXiv Detail & Related papers (2020-10-21T11:56:22Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.