Learning Representations from Audio-Visual Spatial Alignment
- URL: http://arxiv.org/abs/2011.01819v1
- Date: Tue, 3 Nov 2020 16:20:04 GMT
- Title: Learning Representations from Audio-Visual Spatial Alignment
- Authors: Pedro Morgado, Yi Li and Nuno Vasconcelos
- Abstract summary: We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
- Score: 76.29670751012198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel self-supervised pretext task for learning
representations from audio-visual content. Prior work on audio-visual
representation learning leverages correspondences at the video level.
Approaches based on audio-visual correspondence (AVC) predict whether audio and
video clips originate from the same or different video instances. Audio-visual
temporal synchronization (AVTS) further discriminates negative pairs originated
from the same video instance but at different moments in time. While these
approaches learn high-quality representations for downstream tasks such as
action recognition, their training objectives disregard spatial cues naturally
occurring in audio and visual signals. To learn from these spatial cues, we
tasked a network to perform contrastive audio-visual spatial alignment of
360{\deg} video and spatial audio. The ability to perform spatial alignment is
enhanced by reasoning over the full spatial content of the 360{\deg} video
using a transformer architecture to combine representations from multiple
viewpoints. The advantages of the proposed pretext task are demonstrated on a
variety of audio and visual downstream tasks, including audio-visual
correspondence, spatial alignment, action recognition, and video semantic
segmentation.
Related papers
- From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - A Unified Audio-Visual Learning Framework for Localization, Separation,
and Recognition [26.828874753756523]
We propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition.
OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives.
Experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks.
arXiv Detail & Related papers (2023-05-30T23:53:12Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Telling Left from Right: Learning Spatial Correspondence of Sight and
Sound [16.99266133458188]
We propose a novel self-supervised task to leverage a principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream.
We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams.
We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines.
arXiv Detail & Related papers (2020-06-11T04:00:24Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.