BERTraffic: A Robust BERT-Based Approach for Speaker Change Detection
and Role Identification of Air-Traffic Communications
- URL: http://arxiv.org/abs/2110.05781v1
- Date: Tue, 12 Oct 2021 07:25:12 GMT
- Title: BERTraffic: A Robust BERT-Based Approach for Speaker Change Detection
and Role Identification of Air-Traffic Communications
- Authors: Juan Zuluaga-Gomez and Seyyed Saeed Sarfjoo and Amrutha Prasad and
Iuliia Nigmatulina and Petr Motlicek and Oliver Ohneiser and Hartmut Helmke
- Abstract summary: Speech Activity Detection (SAD) or diarization system fails and then two or more single speaker segments are in the same recording.
We developed a system that combines the segmentation of a SAD module with a BERT-based model that performs Speaker Change Detection (SCD) and Speaker Role Identification (SRI) based on ASR transcripts (i.e., diarization + SRI)
The proposed model reaches up to 0.90/0.95 F1-score on ATCO/pilot for SRI on several test sets.
- Score: 2.270534915073284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Speech Recognition (ASR) is gaining special interest in Air Traffic
Control (ATC). ASR allows transcribing the communications between air traffic
controllers (ATCOs) and pilots. These transcriptions are used to extract ATC
command types and named entities such as aircraft callsigns. One common problem
is when the Speech Activity Detection (SAD) or diarization system fails and
then two or more single speaker segments are in the same recording,
jeopardizing the overall system's performance. We developed a system that
combines the segmentation of a SAD module with a BERT-based model that performs
Speaker Change Detection (SCD) and Speaker Role Identification (SRI) based on
ASR transcripts (i.e., diarization + SRI). This research demonstrates on a
real-life ATC test set that performing diarization directly on textual data
surpass acoustic level diarization. The proposed model reaches up to
~0.90/~0.95 F1-score on ATCO/pilot for SRI on several test sets. The text-based
diarization system brings a 27% relative improvement on Diarization Error Rate
(DER) compared to standard acoustic-based diarization. These results were on
ASR transcripts of a challenging ATC test set with an estimated ~13% word error
rate, validating the approach's robustness even on noisy ASR transcripts.
Related papers
- Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control [60.35553925189286]
We propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture.
We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets.
arXiv Detail & Related papers (2024-06-19T21:11:01Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers [0.797970449705065]
We propose a novel virtual simulation-pilot engine for speeding up air traffic controller (ATCo) training.
The engine receives spoken communications from ATCo trainees, and it performs automatic speech recognition and understanding.
To the best of our knowledge, this is the first work fully based on open-source ATC resources and AI tools.
arXiv Detail & Related papers (2023-04-16T17:45:21Z) - ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech
Recognition and Natural Language Understanding of Air Traffic Control
Communications [51.24043482906732]
We introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging air traffic control (ATC) field.
The ATCO2 corpus is split into three subsets.
We expect the ATCO2 corpus will foster research on robust ASR and NLU.
arXiv Detail & Related papers (2022-11-08T07:26:45Z) - Call-sign recognition and understanding for noisy air-traffic
transcripts using surveillance information [72.20674534231314]
Air traffic control (ATC) relies on communication via speech between pilot and air-traffic controller (ATCO)
The call-sign, as unique identifier for each flight, is used to address a specific pilot by the ATCO.
We propose a new call-sign recognition and understanding (CRU) system that addresses this issue.
The recognizer is trained to identify call-signs in noisy ATC transcripts and convert them into the standard International Civil Aviation Organization (ICAO) format.
arXiv Detail & Related papers (2022-04-13T11:30:42Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - A Comparative Study of Speaker Role Identification in Air Traffic
Communication Using Deep Learning Approaches [9.565067058593316]
We formulate the speaker role identification (SRI) task of controller-pilot communication as a binary classification problem.
To ablate the impacts of the comparative approaches, various advanced neural network architectures are applied.
The proposed MMSRINet shows the competitive performance and robustness than the other methods on both seen and unseen data.
arXiv Detail & Related papers (2021-11-03T07:00:20Z) - Grammar Based Identification Of Speaker Role For Improving ATCO And
Pilot ASR [1.1391158217994781]
Assistant Based Speech Recognition (ABSR) for air traffic control is generally trained by pooling both Air Traffic Controller (ATCO) and pilot data.
Due to data imbalance of ATCO and pilot and varying acoustic conditions, the ASR performance is usually significantly better for ATCOs than pilots.
arXiv Detail & Related papers (2021-08-27T08:40:08Z) - Contextual Semi-Supervised Learning: An Approach To Leverage
Air-Surveillance and Untranscribed ATC Data in ASR Systems [0.6465251961564605]
The callsign used to address an airplane is an essential part of all ATCo-pilot communications.
We propose a two-steps approach to add contextual knowledge during semi-supervised training to reduce the ASR system error rates.
arXiv Detail & Related papers (2021-04-08T09:53:54Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.