Related papers: Speech Prediction in Silent Videos using Variational Autoencoders

Speech Prediction in Silent Videos using Variational Autoencoders

URL: http://arxiv.org/abs/2011.07340v1
Date: Sat, 14 Nov 2020 17:09:03 GMT
Title: Speech Prediction in Silent Videos using Variational Autoencoders
Authors: Ravindra Yadav, Ashish Sardana, Vinay P Namboodiri, Rajesh M Hegde
Abstract summary: We present a model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
Score: 29.423462898526605
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional distribution given the visual signal. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.

Related papers

PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling [78.61911985138795]
We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams.<n>We propose the Predictive Future Modeling framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues.<n>Experiments show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters.
arXiv Detail & Related papers (2025-05-29T06:46:19Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection [15.72443573134312]
We treat feature vectors extracted from videos as realizations of a random variable with a fixed distribution. We train our video anomaly detector using a modification of denoising score matching. Our experiments on five popular video anomaly detection benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-03-21T15:46:19Z)
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders. Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z)
Towards Robust FastSpeech 2 by Modelling Residual Multimodality [4.4904382374090765]
State-of-the-art non-autoregressive text-to-speech models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. We observe characteristic audio distortions in expressive speech datasets. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets.
arXiv Detail & Related papers (2023-06-02T11:03:26Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks. We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal. In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective. Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z)
Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time. We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z)
Multi Modal Adaptive Normalization for Audio to Video Generation [18.812696623555855]
We propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components.
arXiv Detail & Related papers (2020-12-14T07:39:45Z)
Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech. To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech. Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.