Automatic audiovisual synchronisation for ultrasound tongue imaging
- URL: http://arxiv.org/abs/2105.15162v1
- Date: Mon, 31 May 2021 17:11:28 GMT
- Title: Automatic audiovisual synchronisation for ultrasound tongue imaging
- Authors: Aciel Eshky, Joanne Cleland, Manuel Sam Ribeiro, Eleanor Sugden, Korin
Richmond, Steve Renals
- Abstract summary: Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised.
Synchronisation is achieved using specialised hardware at recording time, but this approach can fail in practice resulting in data of limited usability.
In this paper, we address the problem of automatically synchronising ultrasound and audio after data collection.
We describe our approach for automatic synchronisation, which is driven by a self-supervised neural network, exploiting the correlation between the two signals to synchronise them.
- Score: 35.60751372748571
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Ultrasound tongue imaging is used to visualise the intra-oral articulators
during speech production. It is utilised in a range of applications, including
speech and language therapy and phonetics research. Ultrasound and speech audio
are recorded simultaneously, and in order to correctly use this data, the two
modalities should be correctly synchronised. Synchronisation is achieved using
specialised hardware at recording time, but this approach can fail in practice
resulting in data of limited usability. In this paper, we address the problem
of automatically synchronising ultrasound and audio after data collection. We
first investigate the tolerance of expert ultrasound users to synchronisation
errors in order to find the thresholds for error detection. We use these
thresholds to define accuracy scoring boundaries for evaluating our system. We
then describe our approach for automatic synchronisation, which is driven by a
self-supervised neural network, exploiting the correlation between the two
signals to synchronise them. We train our model on data from multiple domains
with different speaker characteristics, different equipment, and different
recording environments, and achieve an accuracy >92.4% on held-out in-domain
data. Finally, we introduce a novel resource, the Cleft dataset, which we
gathered with a new clinical subgroup and for which hardware synchronisation
proved unreliable. We apply our model to this out-of-domain data, and evaluate
its performance subjectively with expert users. Results show that users prefer
our model's output over the original hardware output 79.3% of the time. Our
results demonstrate the strength of our approach and its ability to generalise
to data from new domains.
Related papers
- A contrastive-learning approach for auditory attention detection [11.28441753596964]
We propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal.
We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.
arXiv Detail & Related papers (2024-10-24T03:13:53Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - On the Audio-visual Synchronization for Lip-to-Speech Synthesis [22.407313748927393]
We show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues.
Training lip-to-speech with such datasets may further cause the model asynchrony issue -- that is, the generated speech and the input video are out of sync.
arXiv Detail & Related papers (2023-03-01T13:35:35Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z) - Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync.
We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length.
We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z) - TASK3 DCASE2021 Challenge: Sound event localization and detection using
squeeze-excitation residual CNNs [4.4973334555746]
This study is based on the one carried out by the same team last year.
It has been decided to study how this technique improves each of the datasets.
This modification shows an improvement in the performance of the system compared to the baseline using MIC dataset.
arXiv Detail & Related papers (2021-07-30T11:34:15Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.