Self-Supervised Learning of Audio-Visual Objects from Video
- URL: http://arxiv.org/abs/2008.04237v1
- Date: Mon, 10 Aug 2020 16:18:01 GMT
- Title: Self-Supervised Learning of Audio-Visual Objects from Video
- Authors: Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman
- Abstract summary: We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
- Score: 108.77341357556668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our objective is to transform a video into a set of discrete audio-visual
objects using self-supervised learning. To this end, we introduce a model that
uses attention to localize and group sound sources, and optical flow to
aggregate information over time. We demonstrate the effectiveness of the
audio-visual object embeddings that our model learns by using them for four
downstream speech-oriented tasks: (a) multi-speaker sound source separation,
(b) localizing and tracking speakers, (c) correcting misaligned audio-visual
data, and (d) active speaker detection. Using our representation, these tasks
can be solved entirely by training on unlabeled video, without the aid of
object detectors. We also demonstrate the generality of our method by applying
it to non-human speakers, including cartoons and puppets.Our model
significantly outperforms other self-supervised approaches, and obtains
performance competitive with methods that use supervised face detection.
Related papers
- You Only Speak Once to See [24.889319740761827]
Grounding objects in images using visual cues is a well-established approach in computer vision.
We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes.
Experimental results indicate that audio guidance can be effectively applied to object grounding.
arXiv Detail & Related papers (2024-09-27T01:16:15Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Self-supervised Contrastive Learning for Audio-Visual Action Recognition [7.188231323934023]
The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos.
We propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (A), to learn discriminative audio-visual representations for action recognition.
arXiv Detail & Related papers (2022-04-28T10:01:36Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.