End to End Lip Synchronization with a Temporal AutoEncoder
- URL: http://arxiv.org/abs/2203.16224v1
- Date: Wed, 30 Mar 2022 12:00:18 GMT
- Title: End to End Lip Synchronization with a Temporal AutoEncoder
- Authors: Yoav Shalev, Lior Wolf
- Abstract summary: We study the problem of syncing the lip movement in a video with the audio stream.
Our solution finds an optimal alignment using a dual-domain recurrent neural network.
As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream.
- Score: 95.94432031144716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of syncing the lip movement in a video with the audio
stream. Our solution finds an optimal alignment using a dual-domain recurrent
neural network that is trained on synthetic data we generate by dropping and
duplicating video frames. Once the alignment is found, we modify the video in
order to sync the two sources. Our method is shown to greatly outperform the
literature methods on a variety of existing and new benchmarks. As an
application, we demonstrate our ability to robustly align text-to-speech
generated audio with an existing video stream. Our code and samples are
available at
https://github.com/itsyoavshalev/End-to-End-Lip-Synchronization-with-a-Temporal-AutoEncoder.
Related papers
- EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos [3.6078215038168473]
We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos.
generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets.
arXiv Detail & Related papers (2024-07-30T06:57:00Z) - FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z) - Real-time Streaming Video Denoising with Bidirectional Buffers [48.57108807146537]
Real-time denoising algorithms are typically adopted on the user device to remove the noise involved during the shooting and transmission of video streams.
Recent multi-output inference works propagate the bidirectional temporal feature with a parallel or recurrent framework.
We propose a Bidirectional Streaming Video Denoising framework, to achieve high-fidelity real-time denoising for streaming videos with both past and future temporal receptive fields.
arXiv Detail & Related papers (2022-07-14T14:01:03Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Audio-based Near-Duplicate Video Retrieval with Audio Similarity
Learning [19.730467023817123]
We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs.
We train our network following a triplet generation process and optimize the triplet loss function.
The proposed approach achieves very competitive results compared to three state-of-the-art methods.
arXiv Detail & Related papers (2020-10-17T08:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.