Sequence-level self-learning with multiple hypotheses
- URL: http://arxiv.org/abs/2112.05826v1
- Date: Fri, 10 Dec 2021 20:47:58 GMT
- Title: Sequence-level self-learning with multiple hypotheses
- Authors: Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr,
Sefik Emre Eskimez, Jinyu Li and Michael Zeng
- Abstract summary: We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
- Score: 53.04725240411895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we develop new self-learning techniques with an attention-based
sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR).
For untranscribed speech data, the hypothesis from an ASR system must be used
as a label. However, the imperfect ASR result makes unsupervised learning
difficult to consistently improve recognition performance especially in the
case that multiple powerful teacher models are unavailable. In contrast to
conventional unsupervised learning approaches, we adopt the \emph{multi-task
learning} (MTL) framework where the $n$-th best ASR hypothesis is used as the
label of each task. The seq2seq network is updated through the MTL framework so
as to find the common representation that can cover multiple hypotheses. By
doing so, the effect of the \emph{hard-decision} errors can be alleviated.
We first demonstrate the effectiveness of our self-learning methods through
ASR experiments in an accent adaptation task between the US and British English
speech. Our experiment results show that our method can reduce the WER on the
British speech data from 14.55\% to 10.36\% compared to the baseline model
trained with the US English data only. Moreover, we investigate the effect of
our proposed methods in a federated learning scenario.
Related papers
- ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for
Improving ASR Robustness in Spoken Language Understanding [55.39105863825107]
We propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL) to improve automatic speech recognition (ASR) robustness.
In fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively.
Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-19T16:53:35Z) - A Reference-less Quality Metric for Automatic Speech Recognition via
Contrastive-Learning of a Multi-Language Model with Self-Supervision [0.20999222360659603]
This work proposes a referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions.
To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner.
The proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments.
arXiv Detail & Related papers (2023-06-21T21:33:39Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Representative Subset Selection for Efficient Fine-Tuning in
Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR.
We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z) - RescoreBERT: Discriminative Speech Recognition Rescoring with BERT [21.763672436079872]
We show how to train a BERT-based rescoring model with MWER loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR.
We name this approach RescoreBERT, and evaluate it on the LibriSpeech corpus, and it reduces WER by 6.6%/3.4% relative on clean/other test sets over a BERT baseline without discriminative objective.
arXiv Detail & Related papers (2022-02-02T15:45:26Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - A bandit approach to curriculum generation for automatic speech
recognition [7.008190762572486]
We present an approach to mitigate the lack of training data by employing Automated Curriculum Learning.
The goal of the approach is to optimize the training sequence of mini-batches ranked by the level of difficulty.
We test our approach on a truly low-resource language and show that the bandit framework has a good improvement over the baseline transfer-learning model.
arXiv Detail & Related papers (2021-02-06T20:32:10Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.