Related papers: A vector quantized masked autoencoder for audiovisual speech emotion recognition

A vector quantized masked autoencoder for audiovisual speech emotion recognition

URL: http://arxiv.org/abs/2305.03568v2
Date: Wed, 15 May 2024 13:54:49 GMT
Title: A vector quantized masked autoencoder for audiovisual speech emotion recognition
Authors: Samir Sadok, Simon Leglaive, Renaud Séguier,
Abstract summary: The paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning. A multimodal MAE with self- or cross-attention mechanisms is proposed to fuse the audio and visual speech modalities and to learn local and global representations of the audiovisual speech sequence. Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods.
Score: 5.8641712963450825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The limited availability of labeled data is a major challenge in audiovisual speech emotion recognition (SER). Self-supervised learning approaches have recently been proposed to mitigate the need for labeled data in various applications. This paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning and applied to SER. Unlike previous approaches, the proposed method employs a self-supervised paradigm based on discrete audio and visual speech representations learned by vector quantized variational autoencoders. A multimodal MAE with self- or cross-attention mechanisms is proposed to fuse the audio and visual speech modalities and to learn local and global representations of the audiovisual speech sequence, which are then used for an SER downstream task. Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods. Extensive ablation experiments are also provided to assess the contribution of the different model components.

Related papers

CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection [2.110168344647122]
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech. We introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. Our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
arXiv Detail & Related papers (2024-10-18T14:43:34Z)
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos. We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection. Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z)
EnCodecMAE: Leveraging neural codecs for universal audio representation learning [16.590638305972632]
We propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds.
arXiv Detail & Related papers (2023-09-14T02:21:53Z)
A multimodal dynamical variational autoencoder for audiovisual speech representation learning [23.748108659645844]
multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. Experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition.
arXiv Detail & Related papers (2023-05-05T14:37:26Z)
A vector quantized masked autoencoder for speech emotion recognition [3.985839436158186]
We propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset, outperforms an MAE working on the raw spectrogram representation.
arXiv Detail & Related papers (2023-04-21T16:37:57Z)
Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z)
Self-supervised Learning with Random-projection Quantizer for Speech Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict masked speech signals, in the form of discrete labels. It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech. To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech. Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z)
Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio) Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.