Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors
- URL: http://arxiv.org/abs/2210.07055v1
- Date: Thu, 13 Oct 2022 14:25:37 GMT
- Title: Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors
- Authors: Vladimir Iashin and Weidi Xie and Esa Rahtu and Andrew Zisserman
- Abstract summary: The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
- Score: 103.21152156339484
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The objective of this paper is audio-visual synchronisation of general videos
'in the wild'. For such videos, the events that may be harnessed for
synchronisation cues may be spatially small and may occur only infrequently
during a many seconds-long video clip, i.e. the synchronisation signal is
'sparse in space and time'. This contrasts with the case of synchronising
videos of talking heads, where audio-visual correspondence is dense in both
time and space.
We make four contributions: (i) in order to handle longer temporal sequences
required for sparse synchronisation signals, we design a multi-modal
transformer model that employs 'selectors' to distil the long audio and visual
streams into small sequences that are then used to predict the temporal offset
between streams. (ii) We identify artefacts that can arise from the compression
codecs used for audio and video and can be used by audio-visual models in
training to artificially solve the synchronisation task. (iii) We curate a
dataset with only sparse in time and space synchronisation signals; and (iv)
the effectiveness of the proposed model is shown on both dense and sparse
datasets quantitatively and qualitatively.
Project page: v-iashin.github.io/SparseSync
Related papers
- Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity [12.848371604063168]
We propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio with a sequence-to-sequence masked generative model.
Our results show that, by combining a high-quality with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results.
arXiv Detail & Related papers (2024-07-15T01:49:59Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - GestSync: Determining who is speaking without a talking head [67.75387744442727]
We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement.
We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
arXiv Detail & Related papers (2023-10-08T22:48:30Z) - On the Audio-visual Synchronization for Lip-to-Speech Synthesis [22.407313748927393]
We show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues.
Training lip-to-speech with such datasets may further cause the model asynchrony issue -- that is, the generated speech and the input video are out of sync.
arXiv Detail & Related papers (2023-03-01T13:35:35Z) - End to End Lip Synchronization with a Temporal AutoEncoder [95.94432031144716]
We study the problem of syncing the lip movement in a video with the audio stream.
Our solution finds an optimal alignment using a dual-domain recurrent neural network.
As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream.
arXiv Detail & Related papers (2022-03-30T12:00:18Z) - Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync.
We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length.
We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.