Audio-Visual Synchronisation in the wild
- URL: http://arxiv.org/abs/2112.04432v1
- Date: Wed, 8 Dec 2021 17:50:26 GMT
- Title: Audio-Visual Synchronisation in the wild
- Authors: Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea
Vedaldi, Andrew Zisserman
- Abstract summary: We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync.
We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length.
We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
- Score: 149.84890978170174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we consider the problem of audio-visual synchronisation
applied to videos `in-the-wild' (ie of general classes beyond speech). As a new
task, we identify and curate a test set with high audio-visual correlation,
namely VGG-Sound Sync. We compare a number of transformer-based architectural
variants specifically designed to model audio and visual signals of arbitrary
length, while significantly reducing memory requirements during training. We
further conduct an in-depth analysis on the curated dataset and define an
evaluation metric for open domain audio-visual synchronisation. We apply our
method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations
on various aspects. Finally, we set the first benchmark for general
audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound
Sync video dataset. In all cases, our proposed model outperforms the previous
state-of-the-art by a significant margin.
Related papers
- Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval [3.5570874721859016]
We propose a two-staged training procedure in which multiple retrieval models are first trained without estimated correspondences.
In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets.
We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting.
arXiv Detail & Related papers (2024-08-21T14:10:58Z) - SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization [29.53063463863921]
We present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision.
By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner.
Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.
arXiv Detail & Related papers (2024-06-18T03:14:22Z) - Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores [18.26082503192707]
We develop a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization.
In our experiments, we observe a relative gain 50% over a natural extension of Fr'eche't based metrics for Audio-Visual synchrony.
arXiv Detail & Related papers (2024-04-10T20:32:24Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - On the Audio-visual Synchronization for Lip-to-Speech Synthesis [22.407313748927393]
We show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues.
Training lip-to-speech with such datasets may further cause the model asynchrony issue -- that is, the generated speech and the input video are out of sync.
arXiv Detail & Related papers (2023-03-01T13:35:35Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.