Fully Learnable Front-End for Multi-Channel Acoustic Modeling using
Semi-Supervised Learning
- URL: http://arxiv.org/abs/2002.00125v1
- Date: Sat, 1 Feb 2020 02:06:05 GMT
- Title: Fully Learnable Front-End for Multi-Channel Acoustic Modeling using
Semi-Supervised Learning
- Authors: Sanna Wager, Aparna Khare, Minhua Wu, Kenichi Kumatani, Shiva Sundaram
- Abstract summary: We train a fully learnable multi-channel acoustic model for far-field automatic speech recognition.
For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained.
We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly with a beamformer.
- Score: 20.97480659815297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigated the teacher-student training paradigm to train
a fully learnable multi-channel acoustic model for far-field automatic speech
recognition (ASR). Using a large offline teacher model trained on beamformed
audio, we trained a simpler multi-channel student acoustic model used in the
speech recognition system. For the student, both multi-channel feature
extraction layers and the higher classification layers were jointly trained
using the logits from the teacher model. In our experiments, compared to a
baseline model trained on about 600 hours of transcribed data, a relative
word-error rate (WER) reduction of about 27.3% was achieved when using an
additional 1800 hours of untranscribed data. We also investigated the benefit
of pre-training the multi-channel front end to output the beamformed log-mel
filter bank energies (LFBE) using L2 loss. We find that pre-training improves
the word error rate by 10.7% when compared to a multi-channel model directly
initialized with a beamformer and mel-filter bank coefficients for the front
end. Finally, combining pre-training and teacher-student training produces a
WER reduction of 31% compared to our baseline.
Related papers
- Self-Supervised Learning for Multi-Channel Neural Transducer [3.045851438458641]
We explore a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework.
We observed a 66% relative reduction in character error rate compared with the model without any pre-training for the far-field in-house dataset.
arXiv Detail & Related papers (2024-08-06T04:12:31Z) - Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection [57.537583869961885]
Self-supervised speech models are a rapidly developing research topic in fake audio detection.
We apply low-rank adaptation(LoRA) to the wav2vec2 model, freezing the pre-trained model weights and injecting a trainable rank-decomposition matrix into each layer of the transformer architecture.
Compared with fine-tuning with Adam on the wav2vec2 model containing 317M training parameters, LoRA achieved similar performance by reducing the number of trainable parameters by 198 times.
arXiv Detail & Related papers (2023-06-09T01:43:41Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Efficient Utilization of Large Pre-Trained Models for Low Resource ASR [31.57758062484189]
We study a challenging low resource conversational telephony speech corpus from the medical domain in Vietnamese and German.
We show the benefits of using unsupervised techniques beyond simple fine-tuning of large pre-trained models.
arXiv Detail & Related papers (2022-10-26T17:34:30Z) - SPIRAL: Self-supervised Perturbation-Invariant Representation Learning
for Speech Pre-Training [25.80559992732508]
SPIRAL works by learning denoising representation of perturbed data in a teacher-student framework.
We address the problem of noise-robustness that is critical to real-world speech applications.
arXiv Detail & Related papers (2022-01-25T09:53:36Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - A Method to Reveal Speaker Identity in Distributed ASR Training, and How
to Counter It [3.18475216176047]
We design the first method for revealing the identity of the speaker of a training utterance with access only to a gradient.
We show that it is possible to reveal the speaker's identity with 34% top-1 accuracy (51% top-5 accuracy) on the LibriSpeech dataset.
arXiv Detail & Related papers (2021-04-15T23:15:12Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and
Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models.
We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.