Speech recognition for air traffic control via feature learning and
end-to-end training
- URL: http://arxiv.org/abs/2111.02654v1
- Date: Thu, 4 Nov 2021 06:38:21 GMT
- Title: Speech recognition for air traffic control via feature learning and
end-to-end training
- Authors: Peng Fan, Dongyue Guo, Yi Lin, Bo Yang, Jianwei Zhang
- Abstract summary: We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems.
The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss.
Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
- Score: 8.755785876395363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose a new automatic speech recognition (ASR) system
based on feature learning and an end-to-end training procedure for air traffic
control (ATC) systems. The proposed model integrates the feature learning
block, recurrent neural network (RNN), and connectionist temporal
classification loss to build an end-to-end ASR model. Facing the complex
environments of ATC speech, instead of the handcrafted features, a learning
block is designed to extract informative features from raw waveforms for
acoustic modeling. Both the SincNet and 1D convolution blocks are applied to
process the raw waveforms, whose outputs are concatenated to the RNN layers for
the temporal modeling. Thanks to the ability to learn representations from raw
waveforms, the proposed model can be optimized in a complete end-to-end manner,
i.e., from waveform to text. Finally, the multilingual issue in the ATC domain
is also considered to achieve the ASR task by constructing a combined
vocabulary of Chinese characters and English letters. The proposed approach is
validated on a multilingual real-world corpus (ATCSpeech), and the experimental
results demonstrate that the proposed approach outperforms other baselines,
achieving a 6.9\% character error rate.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Text-only domain adaptation for end-to-end ASR using integrated
text-to-mel-spectrogram generator [17.44686265224974]
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both.
We demonstrate that the proposed training method significantly improves ASR accuracy compared to the system trained on transcribed speech only.
arXiv Detail & Related papers (2023-02-27T18:47:55Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Knowledge Transfer from Large-scale Pretrained Language Models to
End-to-end Speech Recognizers [13.372686722688325]
Training of end-to-end speech recognizers always requires transcribed utterances.
This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data.
arXiv Detail & Related papers (2022-02-16T07:02:24Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - End-to-End Spoken Language Understanding using RNN-Transducer ASR [14.267028645397266]
We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance.
It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface.
arXiv Detail & Related papers (2021-06-30T09:20:32Z) - ATCSpeechNet: A multilingual end-to-end speech recognition framework for
air traffic control systems [15.527854608553824]
ATCSpeechNet is proposed to tackle the issue of translating communication speech into human-readable text in air traffic control systems.
An end-to-end paradigm is developed to convert speech waveform into text directly, without any feature engineering or lexicon.
Experimental results on the ATCSpeech corpus demonstrate that the proposed approach achieves a high performance with a very small labeled corpus.
arXiv Detail & Related papers (2021-02-17T02:27:09Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.