Collaborative Training of Acoustic Encoders for Speech Recognition
- URL: http://arxiv.org/abs/2106.08960v1
- Date: Wed, 16 Jun 2021 17:05:47 GMT
- Title: Collaborative Training of Acoustic Encoders for Speech Recognition
- Authors: Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael
L. Seltzer, Vikas Chandra
- Abstract summary: We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition.
We perform experiments using the LibriSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11% relative improvement in the word error rate.
- Score: 15.200846745937763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On-device speech recognition requires training models of different sizes for
deploying on devices with various computational budgets. When building such
different models, we can benefit from training them jointly to take advantage
of the knowledge shared between them. Joint training is also efficient since it
reduces the redundancy in the training procedure's data handling operations. We
propose a method for collaboratively training acoustic encoders of different
sizes for speech recognition. We use a sequence transducer setup where
different acoustic encoders share a common predictor and joiner modules. The
acoustic encoders are also trained using co-distillation through an auxiliary
task for frame level chenone prediction, along with the transducer loss. We
perform experiments using the LibriSpeech corpus and demonstrate that the
collaboratively trained acoustic encoders can provide up to a 11% relative
improvement in the word error rate on both the test partitions.
Related papers
- Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - Revisiting joint decoding based multi-talker speech recognition with DNN
acoustic model [34.061441900912136]
We argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly.
We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers.
arXiv Detail & Related papers (2021-10-31T09:28:04Z) - Semi-supervised Learning for Singing Synthesis Timbre [22.75251024528604]
We propose a semi-supervised singing synthesizer, which is able to learn new voices from audio data only.
Our system is an encoder-decoder model with two encoders, linguistic and acoustic, and one (acoustic) decoder.
We evaluate our system with a listening test and show that the results are comparable to those obtained with an equivalent supervised approach.
arXiv Detail & Related papers (2020-11-05T13:33:34Z) - Boosted Locality Sensitive Hashing: Discriminative Binary Codes for
Source Separation [19.72987718461291]
We propose an adaptive boosting approach to learning locality sensitive hash codes, which represent audio spectra efficiently.
We use the learned hash codes for single-channel speech denoising tasks as an alternative to a complex machine learning model.
arXiv Detail & Related papers (2020-02-14T20:10:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.