Unsupervised Audiovisual Synthesis via Exemplar Autoencoders
- URL: http://arxiv.org/abs/2001.04463v3
- Date: Sat, 3 Jul 2021 05:24:44 GMT
- Title: Unsupervised Audiovisual Synthesis via Exemplar Autoencoders
- Authors: Kangle Deng and Aayush Bansal and Deva Ramanan
- Abstract summary: We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
- Score: 59.13989658692953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an unsupervised approach that converts the input speech of any
individual into audiovisual streams of potentially-infinitely many output
speakers. Our approach builds on simple autoencoders that project out-of-sample
data onto the distribution of the training set. We use Exemplar Autoencoders to
learn the voice, stylistic prosody, and visual appearance of a specific target
exemplar speech. In contrast to existing methods, the proposed approach can be
easily extended to an arbitrarily large number of speakers and styles using
only 3 minutes of target audio-video data, without requiring {\em any} training
data for the input speaker. To do so, we learn audiovisual bottleneck
representations that capture the structured linguistic content of speech. We
outperform prior approaches on both audio and video synthesis, and provide
extensive qualitative analysis on our project page --
https://www.cs.cmu.edu/~exemplar-ae/.
Related papers
- CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection [2.110168344647122]
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech.
We introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models.
Our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
arXiv Detail & Related papers (2024-10-18T14:43:34Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.