ATCSpeechNet: A multilingual end-to-end speech recognition framework for
air traffic control systems
- URL: http://arxiv.org/abs/2102.08535v1
- Date: Wed, 17 Feb 2021 02:27:09 GMT
- Title: ATCSpeechNet: A multilingual end-to-end speech recognition framework for
air traffic control systems
- Authors: Yi Lin, Bo Yang, Linchao Li, Dongyue Guo, Jianwei Zhang, Hu Chen, Yi
Zhang
- Abstract summary: ATCSpeechNet is proposed to tackle the issue of translating communication speech into human-readable text in air traffic control systems.
An end-to-end paradigm is developed to convert speech waveform into text directly, without any feature engineering or lexicon.
Experimental results on the ATCSpeech corpus demonstrate that the proposed approach achieves a high performance with a very small labeled corpus.
- Score: 15.527854608553824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, a multilingual end-to-end framework, called as ATCSpeechNet,
is proposed to tackle the issue of translating communication speech into
human-readable text in air traffic control (ATC) systems. In the proposed
framework, we focus on integrating the multilingual automatic speech
recognition (ASR) into one model, in which an end-to-end paradigm is developed
to convert speech waveform into text directly, without any feature engineering
or lexicon. In order to make up for the deficiency of the handcrafted feature
engineering caused by ATC challenges, a speech representation learning (SRL)
network is proposed to capture robust and discriminative speech representations
from the raw wave. The self-supervised training strategy is adopted to optimize
the SRL network from unlabeled data, and further to predict the speech
features, i.e., wave-to-feature. An end-to-end architecture is improved to
complete the ASR task, in which a grapheme-based modeling unit is applied to
address the multilingual ASR issue. Facing the problem of small transcribed
samples in the ATC domain, an unsupervised approach with mask prediction is
applied to pre-train the backbone network of the ASR model on unlabeled data by
a feature-to-feature process. Finally, by integrating the SRL with ASR, an
end-to-end multilingual ASR framework is formulated in a supervised manner,
which is able to translate the raw wave into text in one model, i.e.,
wave-to-text. Experimental results on the ATCSpeech corpus demonstrate that the
proposed approach achieves a high performance with a very small labeled corpus
and less resource consumption, only 4.20% label error rate on the 58-hour
transcribed corpus. Compared to the baseline model, the proposed approach
obtains over 100% relative performance improvement which can be further
enhanced with the increasing of the size of the transcribed samples.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR.
ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.
We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z) - Communication-Efficient Personalized Federated Learning for
Speech-to-Text Tasks [66.78640306687227]
To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems.
The commonly used FL approach (i.e., textscFedAvg) in S2T tasks typically suffers from extensive communication overhead.
We propose a personalized federated S2T framework that introduces textscFedLoRA, a lightweight LoRA module for client-side tuning and interaction with the server, and textscFedMem, a global model equipped with a $k$-near
arXiv Detail & Related papers (2024-01-18T15:39:38Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Using External Off-Policy Speech-To-Text Mappings in Contextual
End-To-End Automated Speech Recognition [19.489794740679024]
We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods.
In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR.
Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
arXiv Detail & Related papers (2023-01-06T22:32:50Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Speech recognition for air traffic control via feature learning and
end-to-end training [8.755785876395363]
We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems.
The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss.
Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
arXiv Detail & Related papers (2021-11-04T06:38:21Z) - A Comparative Study of Speaker Role Identification in Air Traffic
Communication Using Deep Learning Approaches [9.565067058593316]
We formulate the speaker role identification (SRI) task of controller-pilot communication as a binary classification problem.
To ablate the impacts of the comparative approaches, various advanced neural network architectures are applied.
The proposed MMSRINet shows the competitive performance and robustness than the other methods on both seen and unseen data.
arXiv Detail & Related papers (2021-11-03T07:00:20Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Improving speech recognition models with small samples for air traffic
control systems [9.322392779428505]
In this work, a novel training approach based on pretraining and transfer learning is proposed to address the issue of small training samples.
Three real ATC datasets are used to validate the proposed ASR model and training strategies.
The experimental results demonstrate that the ASR performance is significantly improved on all three datasets.
arXiv Detail & Related papers (2021-02-16T08:28:52Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.