Learning Audio-Visual Correlations from Variational Cross-Modal
Generation
- URL: http://arxiv.org/abs/2102.03424v1
- Date: Fri, 5 Feb 2021 21:27:00 GMT
- Title: Learning Audio-Visual Correlations from Variational Cross-Modal
Generation
- Authors: Ye Zhu, Yu Wu, Hugo Latapie, Yi Yang, Yan Yan
- Abstract summary: We learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner.
The learned correlations can be readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval.
- Score: 35.07257471319274
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: People can easily imagine the potential sound while seeing an event. This
natural synchronization between audio and visual signals reveals their
intrinsic correlations. To this end, we propose to learn the audio-visual
correlations from the perspective of cross-modal generation in a
self-supervised manner, the learned correlations can be then readily applied in
multiple downstream tasks such as the audio-visual cross-modal localization and
retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that
consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional
Wasserstein distance constraint to tackle the problem. Extensive experiments
demonstrate that the optimized latent representation of the proposed MS-VAE can
effectively learn the audio-visual correlations and can be readily applied in
multiple audio-visual downstream tasks to achieve competitive performance even
without any given label information during training.
Related papers
- LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing [26.2873961811614]
We introduce a Learning Interaction method for Non-aligned Knowledge (Link)
Link equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction.
We leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities.
arXiv Detail & Related papers (2024-12-30T11:23:15Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
We present LoCo, a Locality-aware cross-modal Correspondence learning framework for Audio-Visual Events (DAVE)
LoCo applies Locality-aware Correspondence Correction (LCC) to unimodal features via leveraging cross-modal local-correlated properties.
We further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances.
Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV.
We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs.
arXiv Detail & Related papers (2024-07-08T09:45:20Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Cross-Modal Global Interaction and Local Alignment for Audio-Visual
Speech Recognition [21.477900473255264]
We propose a cross-modal global interaction and local alignment (GILA) approach for audio-visual speech recognition (AVSR)
Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level.
Our GILA outperforms the supervised learning state-of-the-art on public benchmarks LRS3 and LRS2.
arXiv Detail & Related papers (2023-05-16T06:41:25Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.