UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data
- URL: http://arxiv.org/abs/2101.07597v1
- Date: Tue, 19 Jan 2021 12:53:43 GMT
- Title: UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data
- Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei,
Michael Zeng and Xuedong Huang
- Abstract summary: We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
- Score: 54.733889961024445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a unified pre-training approach called UniSpeech to
learn speech representations with both unlabeled and labeled data, in which
supervised phonetic CTC learning and phonetically-aware contrastive
self-supervised learning are conducted in a multi-task learning manner. The
resultant representations can capture information more correlated with phonetic
structures and improve the generalization across languages and domains. We
evaluate the effectiveness of UniSpeech for cross-lingual representation
learning on public CommonVoice corpus. The results show that UniSpeech
outperforms self-supervised pretraining and supervised transfer learning for
speech recognition by a maximum of 13.4% and 17.8% relative phone error rate
reductions respectively (averaged over all testing languages). The
transferability of UniSpeech is also demonstrated on a domain-shift speech
recognition task, i.e., a relative word error rate reduction of 6% against the
previous approach.
Related papers
- Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation [23.757896930482342]
This work explores the selection process through a study of downstream tasks.
Units that perform well in resynthesis performance do not necessarily correlate with those that enhance translation efficacy.
arXiv Detail & Related papers (2024-07-08T08:53:26Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Towards End-to-end Unsupervised Speech Recognition [120.4915001021405]
We introduce wvu which does away with all audio-side pre-processing and improves accuracy through better architecture.
In addition, we introduce an auxiliary self-supervised objective that ties model predictions back to the input.
Experiments show that wvuimproves unsupervised recognition results across different languages while being conceptually simpler.
arXiv Detail & Related papers (2022-04-05T21:22:38Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Injecting Text in Self-Supervised Speech Pretraining [33.676479965610774]
We propose to jointly learn representations during pretraining from two different modalities: speech and text.
tts4pretrain complements the power of contrastive learning in self-supervision.
We demonstrate Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task.
arXiv Detail & Related papers (2021-08-27T11:36:40Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.