How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An
Extensive Benchmark on Air Traffic Control Communications
- URL: http://arxiv.org/abs/2203.16822v1
- Date: Thu, 31 Mar 2022 06:10:42 GMT
- Title: How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An
Extensive Benchmark on Air Traffic Control Communications
- Authors: Juan Zuluaga-Gomez, Amrutha Prasad, Iuliia Nigmatulina, Saeed Sarfjoo,
Petr Motlicek, Matthias Kleinert, Hartmut Helmke, Oliver Ohneiser, Qingran
Zhan
- Abstract summary: We study the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases.
We benchmark the proposed models on four challenging ATC test sets.
We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.
- Score: 1.3800173438685746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work on self-supervised pre-training focus on leveraging large-scale
unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM)
that can be later fine-tuned on downstream tasks e.g., automatic speech
recognition (ASR). Yet, few works investigated the impact on performance when
the data substantially differs between the pre-training and downstream
fine-tuning phases (i.e., domain shift). We target this scenario by analyzing
the robustness of Wav2Vec2.0 and XLS-R models on downstream ASR for a
completely unseen domain, i.e., air traffic control (ATC) communications. We
benchmark the proposed models on four challenging ATC test sets
(signal-to-noise ratio varies between 5 to 20 dB). Relative word error rate
(WER) reduction between 20% to 40% are obtained in comparison to hybrid-based
state-of-the-art ASR baselines by fine-tuning E2E acoustic models with a small
fraction of labeled data. We also study the impact of fine-tuning data size on
WERs, going from 5 minutes (few-shot) to 15 hours.
Related papers
- Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context [7.567181073057191]
This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity.
It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment.
arXiv Detail & Related papers (2024-03-29T13:59:34Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and
Self-training of Neural Transducer [20.8850874806462]
This paper proposes a new approach to perform unsupervised fine-tuning and self-training using unlabeled speech data.
For the fine-tuning task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data.
For the self-training task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data.
arXiv Detail & Related papers (2022-07-29T15:14:03Z) - Vision in adverse weather: Augmentation using CycleGANs with various
object detectors for robust perception in autonomous racing [70.16043883381677]
In autonomous racing, the weather can change abruptly, causing significant degradation in perception, resulting in ineffective manoeuvres.
In order to improve detection in adverse weather, deep-learning-based models typically require extensive datasets captured in such conditions.
We introduce an approach of using synthesised adverse condition datasets in autonomous racing (generated using CycleGAN) to improve the performance of four out of five state-of-the-art detectors.
arXiv Detail & Related papers (2022-01-10T10:02:40Z) - BERTraffic: A Robust BERT-Based Approach for Speaker Change Detection
and Role Identification of Air-Traffic Communications [2.270534915073284]
Speech Activity Detection (SAD) or diarization system fails and then two or more single speaker segments are in the same recording.
We developed a system that combines the segmentation of a SAD module with a BERT-based model that performs Speaker Change Detection (SCD) and Speaker Role Identification (SRI) based on ASR transcripts (i.e., diarization + SRI)
The proposed model reaches up to 0.90/0.95 F1-score on ATCO/pilot for SRI on several test sets.
arXiv Detail & Related papers (2021-10-12T07:25:12Z) - Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition [44.347739529374124]
Self-supervised pre-training has dramatically improved the performance of automatic speech recognition (ASR)
Most existing self-supervised pre-training approaches are task-agnostic, i.e., could be applied to various downstream tasks.
We propose a novel pre-training paradigm called wav2vec-S, where we use task-specific semi-supervised pre-training to bridge this gap.
arXiv Detail & Related papers (2021-10-09T07:09:22Z) - Prediction of Traffic Flow via Connected Vehicles [77.11902188162458]
We propose a Short-term Traffic flow Prediction framework so that transportation authorities take early actions to control flow and prevent congestion.
We anticipate flow at future time frames on a target road segment based on historical flow data and innovative features such as real time feeds and trajectory data provided by Connected Vehicles (CV) technology.
We show how this novel approach allows advanced modelling by integrating into the forecasting of flow, the impact of various events that CV realistically encountered on segments along their trajectory.
arXiv Detail & Related papers (2020-07-10T16:00:44Z) - Automatic Speech Recognition Benchmark for Air-Traffic Communications [1.175956452196938]
CleanSky EC-H2020 ATCO2 aims to develop an ASR-based platform to collect, organize and automatically pre-process ATCo speech-data from air space.
Cross-accent flaws due to speakers' accents are minimized due to the amount of data, making the system feasible for ATC environments.
arXiv Detail & Related papers (2020-06-18T06:49:22Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.