Related papers: End to End Lip Synchronization with a Temporal AutoEncoder

End to End Lip Synchronization with a Temporal AutoEncoder

URL: http://arxiv.org/abs/2203.16224v1
Date: Wed, 30 Mar 2022 12:00:18 GMT
Title: End to End Lip Synchronization with a Temporal AutoEncoder
Authors: Yoav Shalev, Lior Wolf
Abstract summary: We study the problem of syncing the lip movement in a video with the audio stream. Our solution finds an optimal alignment using a dual-domain recurrent neural network. As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream.
Score: 95.94432031144716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of syncing the lip movement in a video with the audio stream. Our solution finds an optimal alignment using a dual-domain recurrent neural network that is trained on synthetic data we generate by dropping and duplicating video frames. Once the alignment is found, we modify the video in order to sync the two sources. Our method is shown to greatly outperform the literature methods on a variety of existing and new benchmarks. As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream. Our code and samples are available at https://github.com/itsyoavshalev/End-to-End-Lip-Synchronization-with-a-Temporal-AutoEncoder.

Related papers

Long-Video Audio Synthesis with Multi-Agent Collaboration [20.332328741375363]
LVAS-Agent is a novel framework that emulates professional dubbing through collaborative role. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment.
arXiv Detail & Related papers (2025-03-13T07:58:23Z)
Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition [31.25956665297592]
We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them. We can effectively predict them from video by a devised video-to-all (V2X) predictor. Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process.
arXiv Detail & Related papers (2025-03-10T07:04:03Z)
EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos [3.6078215038168473]
We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets.
arXiv Detail & Related papers (2024-07-30T06:57:00Z)
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video. We propose Frieren, a V2A model based on rectified flow matching. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z)
Large-scale unsupervised audio pre-training for video-to-speech synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz. We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z)
Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild' We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors' We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z)
Real-time Streaming Video Denoising with Bidirectional Buffers [48.57108807146537]
Real-time denoising algorithms are typically adopted on the user device to remove the noise involved during the shooting and transmission of video streams. Recent multi-output inference works propagate the bidirectional temporal feature with a parallel or recurrent framework. We propose a Bidirectional Streaming Video Denoising framework, to achieve high-fidelity real-time denoising for streaming videos with both past and future temporal receptive fields.
arXiv Detail & Related papers (2022-07-14T14:01:03Z)
End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning [19.730467023817123]
We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. We train our network following a triplet generation process and optimize the triplet loss function. The proposed approach achieves very competitive results compared to three state-of-the-art methods.
arXiv Detail & Related papers (2020-10-17T08:12:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.