ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic
Control Using Multi-Objective Learning
- URL: http://arxiv.org/abs/2312.06118v1
- Date: Mon, 11 Dec 2023 04:51:41 GMT
- Title: ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic
Control Using Multi-Objective Learning
- Authors: Xincheng Yu, Dongyue Guo, Jianwei Zhang, Yi Lin
- Abstract summary: A recognition-oriented speech enhancement (ROSE) framework is proposed to improve speech intelligibility and advance ASR accuracy.
An encoder-decoder-based U-Net framework is proposed to eliminate the radio speech echo based on the real-world collected corpus.
- Score: 7.216270043333772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Radio speech echo is a specific phenomenon in the air traffic control (ATC)
domain, which degrades speech quality and further impacts automatic speech
recognition (ASR) accuracy. In this work, a recognition-oriented speech
enhancement (ROSE) framework is proposed to improve speech intelligibility and
also advance ASR accuracy, which serves as a plug-and-play tool in ATC
scenarios and does not require additional retraining of the ASR model.
Specifically, an encoder-decoder-based U-Net framework is proposed to eliminate
the radio speech echo based on the real-world collected corpus. By
incorporating the SE-oriented and ASR-oriented loss, ROSE is implemented in a
multi-objective manner by learning shared representations across the two
optimization objectives. An attention-based skip-fusion (ABSF) mechanism is
applied to skip connections to refine the features. A channel and sequence
attention (CSAtt) block is innovatively designed to guide the model to focus on
informative representations and suppress disturbing features. The experimental
results show that the ROSE significantly outperforms other state-of-the-art
methods for both the SE and ASR tasks. In addition, the proposed approach can
contribute to the desired performance improvements on public datasets.
Related papers
- Speech enhancement with frequency domain auto-regressive modeling [34.55703785405481]
Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation.
We propose a unified framework of speech dereverberation for improving the speech quality and the automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2023-09-24T03:25:51Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Leveraging Modality-specific Representations for Audio-visual Speech
Recognition via Reinforcement Learning [25.743503223389784]
We propose a reinforcement learning (RL) based framework called MSRL.
We customize a reward function directly related to task-specific metrics.
Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions.
arXiv Detail & Related papers (2022-12-10T14:01:54Z) - Enhancing and Adversarial: Improve ASR with Speaker Labels [49.73714831258699]
We propose a novel adaptive gradient reversal layer for stable and effective adversarial training without tuning effort.
Detailed analysis and experimental verification are conducted to show the optimal positions in the ASR neural network (NN) to apply speaker enhancing and adversarial training.
Our best speaker-based MTL achieves 7% relative improvement on the Switchboard Hub5'00 set.
arXiv Detail & Related papers (2022-11-11T17:40:08Z) - CTA-RNN: Channel and Temporal-wise Attention RNN Leveraging Pre-trained
ASR Embeddings for Speech Emotion Recognition [20.02248459288662]
We propose a novel channel and temporal-wise attention RNN architecture based on the intermediate representations of pre-trained ASR models.
We evaluate our approach on two popular benchmark datasets, IEMOCAP and MSP-IMPROV.
arXiv Detail & Related papers (2022-03-31T13:32:51Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Speech recognition for air traffic control via feature learning and
end-to-end training [8.755785876395363]
We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems.
The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss.
Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
arXiv Detail & Related papers (2021-11-04T06:38:21Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition
with Source Localization [73.62550438861942]
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR)
In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance.
arXiv Detail & Related papers (2020-10-30T20:26:28Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.