Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
- URL: http://arxiv.org/abs/2309.13942v1
- Date: Mon, 25 Sep 2023 08:22:30 GMT
- Title: Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
- Authors: Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen James, Zhan Tong,
Chongjian Ge, Pieter Abbeel, Yun-hui Liu
- Abstract summary: We propose a speed co-augmentation method that randomly changes the playback speeds of both audio and video data.
Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
- Score: 102.18680666349806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work aims to improve unsupervised audio-visual pre-training. Inspired by
the efficacy of data augmentation in visual contrastive learning, we propose a
novel speed co-augmentation method that randomly changes the playback speeds of
both audio and video data. Despite its simplicity, the speed co-augmentation
method possesses two compelling attributes: (1) it increases the diversity of
audio-visual pairs and doubles the size of negative pairs, resulting in a
significant enhancement in the learned representations, and (2) it changes the
strict correlation between audio-visual pairs but introduces a partial
relationship between the augmented pairs, which is modeled by our proposed
SoftInfoNCE loss to further boost the performance. Experimental results show
that the proposed method significantly improves the learned representations
when compared to vanilla audio-visual contrastive learning.
Related papers
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning [36.012107899738524]
We introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning.
Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor.
It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision.
arXiv Detail & Related papers (2024-03-14T15:44:19Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning [3.6204417068568424]
We use dubbed versions of movies and television shows to augment cross-modal contrastive learning.
Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video.
arXiv Detail & Related papers (2023-04-12T04:17:45Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Multi-Modal Multi-Correlation Learning for Audio-Visual Speech
Separation [38.75352529988137]
We propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation.
We define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation.
For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations.
arXiv Detail & Related papers (2022-07-04T04:53:39Z) - The Impact of Spatiotemporal Augmentations on Self-Supervised
Audiovisual Representation Learning [2.28438857884398]
We present a contrastive framework to learn audiovisual representations from unlabeled videos.
We find lossy-temporal transformations that do not corrupt the temporal coherency of videos are the most effective.
Compared to self-supervised models pre-trained on only sampling-based temporal augmentation, self-supervised models pre-trained with our temporal augmentations lead to approximately 6.5% gain on linear performance on dataset AVE.
arXiv Detail & Related papers (2021-10-13T23:48:58Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - How to Teach DNNs to Pay Attention to the Visual Modality in Speech
Recognition [10.74796391075403]
This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns.
We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern.
We propose a regularisation method which involves predicting lip-related Action Units from visual representations.
arXiv Detail & Related papers (2020-04-17T13:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.