A Comparative Study of Speaker Role Identification in Air Traffic
Communication Using Deep Learning Approaches
- URL: http://arxiv.org/abs/2111.02041v1
- Date: Wed, 3 Nov 2021 07:00:20 GMT
- Title: A Comparative Study of Speaker Role Identification in Air Traffic
Communication Using Deep Learning Approaches
- Authors: Dongyue Guo, Jianwei Zhang, Bo Yang, Yi Lin
- Abstract summary: We formulate the speaker role identification (SRI) task of controller-pilot communication as a binary classification problem.
To ablate the impacts of the comparative approaches, various advanced neural network architectures are applied.
The proposed MMSRINet shows the competitive performance and robustness than the other methods on both seen and unseen data.
- Score: 9.565067058593316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic spoken instruction understanding (SIU) of the controller-pilot
conversations in the air traffic control (ATC) requires not only recognizing
the words and semantics of the speech but also determining the role of the
speaker. However, few of the published works on the automatic understanding
systems in air traffic communication focus on speaker role identification
(SRI). In this paper, we formulate the SRI task of controller-pilot
communication as a binary classification problem. Furthermore, the text-based,
speech-based, and speech and text based multi-modal methods are proposed to
achieve a comprehensive comparison of the SRI task. To ablate the impacts of
the comparative approaches, various advanced neural network architectures are
applied to optimize the implementation of text-based and speech-based methods.
Most importantly, a multi-modal speaker role identification network (MMSRINet)
is designed to achieve the SRI task by considering both the speech and textual
modality features. To aggregate modality features, the modal fusion module is
proposed to fuse and squeeze acoustic and textual representations by modal
attention mechanism and self-attention pooling layer, respectively. Finally,
the comparative approaches are validated on the ATCSpeech corpus collected from
a real-world ATC environment. The experimental results demonstrate that all the
comparative approaches are worked for the SRI task, and the proposed MMSRINet
shows the competitive performance and robustness than the other methods on both
seen and unseen data, achieving 98.56%, and 98.08% accuracy, respectively.
Related papers
- Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control [60.35553925189286]
We propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture.
We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets.
arXiv Detail & Related papers (2024-06-19T21:11:01Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - Direction-Aware Joint Adaptation of Neural Speech Enhancement and
Recognition in Real Multiparty Conversational Environments [21.493664174262737]
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments.
We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions.
arXiv Detail & Related papers (2022-07-15T03:43:35Z) - Leveraging Acoustic Contextual Representation by Audio-textual
Cross-modal Learning for Conversational ASR [25.75615870266786]
We propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech.
The effectiveness of the proposed approach is validated on several Mandarin conversation corpora.
arXiv Detail & Related papers (2022-07-03T13:32:24Z) - BERTraffic: A Robust BERT-Based Approach for Speaker Change Detection
and Role Identification of Air-Traffic Communications [2.270534915073284]
Speech Activity Detection (SAD) or diarization system fails and then two or more single speaker segments are in the same recording.
We developed a system that combines the segmentation of a SAD module with a BERT-based model that performs Speaker Change Detection (SCD) and Speaker Role Identification (SRI) based on ASR transcripts (i.e., diarization + SRI)
The proposed model reaches up to 0.90/0.95 F1-score on ATCO/pilot for SRI on several test sets.
arXiv Detail & Related papers (2021-10-12T07:25:12Z) - ATCSpeechNet: A multilingual end-to-end speech recognition framework for
air traffic control systems [15.527854608553824]
ATCSpeechNet is proposed to tackle the issue of translating communication speech into human-readable text in air traffic control systems.
An end-to-end paradigm is developed to convert speech waveform into text directly, without any feature engineering or lexicon.
Experimental results on the ATCSpeech corpus demonstrate that the proposed approach achieves a high performance with a very small labeled corpus.
arXiv Detail & Related papers (2021-02-17T02:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.