A multimodal dynamical variational autoencoder for audiovisual speech
representation learning
- URL: http://arxiv.org/abs/2305.03582v3
- Date: Tue, 20 Feb 2024 16:18:45 GMT
- Title: A multimodal dynamical variational autoencoder for audiovisual speech
representation learning
- Authors: Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda,
Renaud S\'eguier
- Abstract summary: multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning.
Experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition.
- Score: 23.748108659645844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to
unsupervised audio-visual speech representation learning. The latent space is
structured to dissociate the latent dynamical factors that are shared between
the modalities from those that are specific to each modality. A static latent
variable is also introduced to encode the information that is constant over
time within an audiovisual speech sequence. The model is trained in an
unsupervised manner on an audiovisual emotional speech dataset, in two stages.
In the first stage, a vector quantized VAE (VQ-VAE) is learned independently
for each modality, without temporal modeling. The second stage consists in
learning the MDVAE model on the intermediate representation of the VQ-VAEs
before quantization. The disentanglement between static versus dynamical and
modality-specific versus modality-common information occurs during this second
training stage. Extensive experiments are conducted to investigate how
audiovisual speech latent factors are encoded in the latent space of MDVAE.
These experiments include manipulating audiovisual speech, audiovisual facial
image denoising, and audiovisual speech emotion recognition. The results show
that MDVAE effectively combines the audio and visual information in its latent
space. They also show that the learned static representation of audiovisual
speech can be used for emotion recognition with few labeled data, and with
better accuracy compared with unimodal baselines and a state-of-the-art
supervised model based on an audiovisual transformer architecture.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - A vector quantized masked autoencoder for audiovisual speech emotion recognition [5.8641712963450825]
The paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning.
A multimodal MAE with self- or cross-attention mechanisms is proposed to fuse the audio and visual speech modalities and to learn local and global representations of the audiovisual speech sequence.
Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods.
arXiv Detail & Related papers (2023-05-05T14:19:46Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Where and When: Space-Time Attention for Audio-Visual Explanations [42.093794819606444]
We propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time.
Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear.
arXiv Detail & Related papers (2021-05-04T14:16:55Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech.
To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech.
Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.