Automatic Speech Recognition Benchmark for Air-Traffic Communications
- URL: http://arxiv.org/abs/2006.10304v2
- Date: Thu, 13 Aug 2020 06:46:34 GMT
- Title: Automatic Speech Recognition Benchmark for Air-Traffic Communications
- Authors: Juan Zuluaga-Gomez and Petr Motlicek and Qingran Zhan and Karel Vesely
and Rudolf Braun
- Abstract summary: CleanSky EC-H2020 ATCO2 aims to develop an ASR-based platform to collect, organize and automatically pre-process ATCo speech-data from air space.
Cross-accent flaws due to speakers' accents are minimized due to the amount of data, making the system feasible for ATC environments.
- Score: 1.175956452196938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in Automatic Speech Recognition (ASR) over the last decade opened
new areas of speech-based automation such as in Air-Traffic Control (ATC)
environment. Currently, voice communication and data links communications are
the only way of contact between pilots and Air-Traffic Controllers (ATCo),
where the former is the most widely used and the latter is a non-spoken method
mandatory for oceanic messages and limited for some domestic issues. ASR
systems on ATCo environments inherit increasing complexity due to accents from
non-English speakers, cockpit noise, speaker-dependent biases, and small
in-domain ATC databases for training. Hereby, we introduce CleanSky EC-H2020
ATCO2, a project that aims to develop an ASR-based platform to collect,
organize and automatically pre-process ATCo speech-data from air space. This
paper conveys an exploratory benchmark of several state-of-the-art ASR models
trained on more than 170 hours of ATCo speech-data. We demonstrate that the
cross-accent flaws due to speakers' accents are minimized due to the amount of
data, making the system feasible for ATC environments. The developed ASR system
achieves an averaged word error rate (WER) of 7.75% across four databases. An
additional 35% relative improvement in WER is achieved on one test set when
training a TDNNF system with byte-pair encoding.
Related papers
- Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control [60.35553925189286]
We propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture.
We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets.
arXiv Detail & Related papers (2024-06-19T21:11:01Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Lessons Learned in ATCO2: 5000 hours of Air Traffic Control
Communications for Robust Automatic Speech Recognition and Understanding [3.4713477325880464]
ATCO2 project aimed to develop a unique platform to collect and preprocess large amounts of ATC data from airspace in real time.
This paper reviews previous work from ATCO2 partners, including robust automatic speech recognition.
We believe that the pipeline developed during the ATCO2 project, along with the open-sourcing of its data, will encourage research in the ATC field.
arXiv Detail & Related papers (2023-05-02T02:04:33Z) - A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers [0.797970449705065]
We propose a novel virtual simulation-pilot engine for speeding up air traffic controller (ATCo) training.
The engine receives spoken communications from ATCo trainees, and it performs automatic speech recognition and understanding.
To the best of our knowledge, this is the first work fully based on open-source ATC resources and AI tools.
arXiv Detail & Related papers (2023-04-16T17:45:21Z) - ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech
Recognition and Natural Language Understanding of Air Traffic Control
Communications [51.24043482906732]
We introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging air traffic control (ATC) field.
The ATCO2 corpus is split into three subsets.
We expect the ATCO2 corpus will foster research on robust ASR and NLU.
arXiv Detail & Related papers (2022-11-08T07:26:45Z) - How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An
Extensive Benchmark on Air Traffic Control Communications [1.3800173438685746]
We study the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases.
We benchmark the proposed models on four challenging ATC test sets.
We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.
arXiv Detail & Related papers (2022-03-31T06:10:42Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - BERTraffic: A Robust BERT-Based Approach for Speaker Change Detection
and Role Identification of Air-Traffic Communications [2.270534915073284]
Speech Activity Detection (SAD) or diarization system fails and then two or more single speaker segments are in the same recording.
We developed a system that combines the segmentation of a SAD module with a BERT-based model that performs Speaker Change Detection (SCD) and Speaker Role Identification (SRI) based on ASR transcripts (i.e., diarization + SRI)
The proposed model reaches up to 0.90/0.95 F1-score on ATCO/pilot for SRI on several test sets.
arXiv Detail & Related papers (2021-10-12T07:25:12Z) - Contextual Semi-Supervised Learning: An Approach To Leverage
Air-Surveillance and Untranscribed ATC Data in ASR Systems [0.6465251961564605]
The callsign used to address an airplane is an essential part of all ATCo-pilot communications.
We propose a two-steps approach to add contextual knowledge during semi-supervised training to reduce the ASR system error rates.
arXiv Detail & Related papers (2021-04-08T09:53:54Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.