Improving speech recognition models with small samples for air traffic
control systems
- URL: http://arxiv.org/abs/2102.08015v1
- Date: Tue, 16 Feb 2021 08:28:52 GMT
- Title: Improving speech recognition models with small samples for air traffic
control systems
- Authors: Yi Lin, Qin Li, Bo Yang, Zhen Yan, Huachun Tan, and Zhengmao Chen
- Abstract summary: In this work, a novel training approach based on pretraining and transfer learning is proposed to address the issue of small training samples.
Three real ATC datasets are used to validate the proposed ASR model and training strategies.
The experimental results demonstrate that the ASR performance is significantly improved on all three datasets.
- Score: 9.322392779428505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the domain of air traffic control (ATC) systems, efforts to train a
practical automatic speech recognition (ASR) model always faces the problem of
small training samples since the collection and annotation of speech samples
are expert- and domain-dependent task. In this work, a novel training approach
based on pretraining and transfer learning is proposed to address this issue,
and an improved end-to-end deep learning model is developed to address the
specific challenges of ASR in the ATC domain. An unsupervised pretraining
strategy is first proposed to learn speech representations from unlabeled
samples for a certain dataset. Specifically, a masking strategy is applied to
improve the diversity of the sample without losing their general patterns.
Subsequently, transfer learning is applied to fine-tune a pretrained or other
optimized baseline models to finally achieves the supervised ASR task. By
virtue of the common terminology used in the ATC domain, the transfer learning
task can be regarded as a sub-domain adaption task, in which the transferred
model is optimized using a joint corpus consisting of baseline samples and new
transcribed samples from the target dataset. This joint corpus construction
strategy enriches the size and diversity of the training samples, which is
important for addressing the issue of the small transcribed corpus. In
addition, speed perturbation is applied to augment the new transcribed samples
to further improve the quality of the speech corpus. Three real ATC datasets
are used to validate the proposed ASR model and training strategies. The
experimental results demonstrate that the ASR performance is significantly
improved on all three datasets, with an absolute character error rate only
one-third of that achieved through the supervised training. The applicability
of the proposed strategies to other ASR approaches is also validated.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping [64.8477128397529]
We propose a training-required and training-free test-time adaptation framework.
We maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples.
We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets.
arXiv Detail & Related papers (2024-10-20T15:58:43Z) - Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification [34.37262622415682]
We propose a new adaptation framework called Data Adaptive Traceback.
Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data.
We adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning.
arXiv Detail & Related papers (2024-07-11T18:01:58Z) - ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision.
This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline.
Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z) - Iterative self-transfer learning: A general methodology for response
time-history prediction based on small dataset [0.0]
An iterative self-transfer learningmethod for training neural networks based on small datasets is proposed in this study.
The results show that the proposed method can improve the model performance by near an order of magnitude on small datasets.
arXiv Detail & Related papers (2023-06-14T18:48:04Z) - Adaptive Multi-Corpora Language Model Training for Speech Recognition [13.067901680326932]
We introduce a novel adaptive multi-corpora training algorithm that dynamically learns and adjusts the sampling probability of each corpus along the training process.
Compared with static sampling strategy baselines, the proposed approach yields remarkable improvement.
arXiv Detail & Related papers (2022-11-09T06:54:50Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - ATCSpeechNet: A multilingual end-to-end speech recognition framework for
air traffic control systems [15.527854608553824]
ATCSpeechNet is proposed to tackle the issue of translating communication speech into human-readable text in air traffic control systems.
An end-to-end paradigm is developed to convert speech waveform into text directly, without any feature engineering or lexicon.
Experimental results on the ATCSpeech corpus demonstrate that the proposed approach achieves a high performance with a very small labeled corpus.
arXiv Detail & Related papers (2021-02-17T02:27:09Z) - Open-set Short Utterance Forensic Speaker Verification using
Teacher-Student Network with Explicit Inductive Bias [59.788358876316295]
We propose a pipeline solution to improve speaker verification on a small actual forensic field dataset.
By leveraging large-scale out-of-domain datasets, a knowledge distillation based objective function is proposed for teacher-student learning.
We show that the proposed objective function can efficiently improve the performance of teacher-student learning on short utterances.
arXiv Detail & Related papers (2020-09-21T00:58:40Z) - One-Shot Object Detection without Fine-Tuning [62.39210447209698]
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module.
We also propose novel training strategies that effectively improve detection performance.
Our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
arXiv Detail & Related papers (2020-05-08T01:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.