A Single Self-Supervised Model for Many Speech Modalities Enables
Zero-Shot Modality Transfer
- URL: http://arxiv.org/abs/2207.07036v1
- Date: Thu, 14 Jul 2022 16:21:33 GMT
- Title: A Single Self-Supervised Model for Many Speech Modalities Enables
Zero-Shot Modality Transfer
- Authors: Wei-Ning Hsu, Bowen Shi
- Abstract summary: We present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech.
Our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.
- Score: 31.028408352051684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While audio-visual speech models can yield superior performance and
robustness compared to audio-only models, their development and adoption are
hindered by the lack of labeled and unlabeled audio-visual data and the cost to
deploy one model per modality. In this paper, we present u-HuBERT, a
self-supervised pre-training framework that can leverage both multimodal and
unimodal speech with a unified masked cluster prediction objective. By
utilizing modality dropout during pre-training, we demonstrate that a single
fine-tuned model can achieve performance on par or better than the
state-of-the-art modality-specific models. Moreover, our model fine-tuned only
on audio can perform well with audio-visual and visual speech input, achieving
zero-shot modality generalization for speech recognition and speaker
verification. In particular, our single model yields 1.2%/1.4%/27.2% speech
recognition word error rate on LRS3 with audio-visual/audio/visual input.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - Self-Supervised Learning for Personalized Speech Enhancement [25.05285328404576]
Speech enhancement systems can show improved performance by adapting the model towards a single test-time speaker.
Test-time user might only provide a small amount of noise-free speech data, likely insufficient for traditional fully-supervised learning.
We propose self-supervised methods that are designed specifically to learn personalized and discriminative features from abundant in-the-wild noisy, but still personal speech recordings.
arXiv Detail & Related papers (2021-04-05T17:12:51Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.