On the Audio-visual Synchronization for Lip-to-Speech Synthesis
- URL: http://arxiv.org/abs/2303.00502v1
- Date: Wed, 1 Mar 2023 13:35:35 GMT
- Title: On the Audio-visual Synchronization for Lip-to-Speech Synthesis
- Authors: Zhe Niu and Brian Mak
- Abstract summary: We show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues.
Training lip-to-speech with such datasets may further cause the model asynchrony issue -- that is, the generated speech and the input video are out of sync.
- Score: 22.407313748927393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most lip-to-speech (LTS) synthesis models are trained and evaluated under the
assumption that the audio-video pairs in the dataset are perfectly
synchronized. In this work, we show that the commonly used audio-visual
datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony
issues. Training lip-to-speech with such datasets may further cause the model
asynchrony issue -- that is, the generated speech and the input video are out
of sync. To address these asynchrony issues, we propose a synchronized
lip-to-speech (SLTS) model with an automatic synchronization mechanism (ASM) to
correct data asynchrony and penalize model asynchrony. We further demonstrate
the limitation of the commonly adopted evaluation metrics for LTS with
asynchronous test data and introduce an audio alignment frontend before the
metrics sensitive to time alignment for better evaluation. We compare our
method with state-of-the-art approaches on conventional and time-aligned
metrics to show the benefits of synchronization training.
Related papers
- Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation [51.92522679353731]
We propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training.
We introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance.
arXiv Detail & Related papers (2024-05-07T13:55:50Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - GestSync: Determining who is speaking without a talking head [67.75387744442727]
We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement.
We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
arXiv Detail & Related papers (2023-10-08T22:48:30Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z) - End to End Lip Synchronization with a Temporal AutoEncoder [95.94432031144716]
We study the problem of syncing the lip movement in a video with the audio stream.
Our solution finds an optimal alignment using a dual-domain recurrent neural network.
As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream.
arXiv Detail & Related papers (2022-03-30T12:00:18Z) - Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync.
We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length.
We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z) - Automatic audiovisual synchronisation for ultrasound tongue imaging [35.60751372748571]
Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised.
Synchronisation is achieved using specialised hardware at recording time, but this approach can fail in practice resulting in data of limited usability.
In this paper, we address the problem of automatically synchronising ultrasound and audio after data collection.
We describe our approach for automatic synchronisation, which is driven by a self-supervised neural network, exploiting the correlation between the two signals to synchronise them.
arXiv Detail & Related papers (2021-05-31T17:11:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.