ChannelAugment: Improving generalization of multi-channel ASR by
training with input channel randomization
- URL: http://arxiv.org/abs/2109.11225v1
- Date: Thu, 23 Sep 2021 09:13:47 GMT
- Title: ChannelAugment: Improving generalization of multi-channel ASR by
training with input channel randomization
- Authors: Marco Gaudesi, Felix Weninger, Dushyant Sharma, Puming Zhan
- Abstract summary: End-to-end (E2E) multi-channel ASR systems show state-of-the-art performance in far-field ASR tasks.
Main limitation of such systems is that they are usually trained with data from a fixed array geometry.
We present a simple and effective data augmentation technique, which is based on randomly dropping channels in the multi-channel audio input during training.
- Score: 6.42706307642403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) multi-channel ASR systems show state-of-the-art performance
in far-field ASR tasks by joint training of a multi-channel front-end along
with the ASR model. The main limitation of such systems is that they are
usually trained with data from a fixed array geometry, which can lead to
degradation in accuracy when a different array is used in testing. This makes
it challenging to deploy these systems in practice, as it is costly to retrain
and deploy different models for various array configurations. To address this,
we present a simple and effective data augmentation technique, which is based
on randomly dropping channels in the multi-channel audio input during training,
in order to improve the robustness to various array configurations at test
time. We call this technique ChannelAugment, in contrast to SpecAugment (SA)
which drops time and/or frequency components of a single channel input audio.
We apply ChannelAugment to the Spatial Filtering (SF) and Minimum Variance
Distortionless Response (MVDR) neural beamforming approaches. For SF, we
observe 10.6% WER improvement across various array configurations employing
different numbers of microphones. For MVDR, we achieve a 74% reduction in
training time without causing degradation of recognition accuracy.
Related papers
- Self-Supervised Learning for Multi-Channel Neural Transducer [3.045851438458641]
We explore a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework.
We observed a 66% relative reduction in character error rate compared with the model without any pre-training for the far-field in-house dataset.
arXiv Detail & Related papers (2024-08-06T04:12:31Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Joint Channel Estimation and Feedback with Masked Token Transformers in
Massive MIMO Systems [74.52117784544758]
This paper proposes an encoder-decoder based network that unveils the intrinsic frequency-domain correlation within the CSI matrix.
The entire encoder-decoder network is utilized for channel compression.
Our method outperforms state-of-the-art channel estimation and feedback techniques in joint tasks.
arXiv Detail & Related papers (2023-06-08T06:15:17Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Disentangled Representation Learning for RF Fingerprint Extraction under
Unknown Channel Statistics [77.13542705329328]
We propose a framework of disentangled representation learning(DRL) that first learns to factor the input signals into a device-relevant component and a device-irrelevant component via adversarial learning.
The implicit data augmentation in the proposed framework imposes a regularization on the RFF extractor to avoid the possible overfitting of device-irrelevant channel statistics.
Experiments validate that the proposed approach, referred to as DR-RFF, outperforms conventional methods in terms of generalizability to unknown complicated propagation environments.
arXiv Detail & Related papers (2022-08-04T15:46:48Z) - Omni-frequency Channel-selection Representations for Unsupervised
Anomaly Detection [11.926787216956459]
We propose a novel Omni-frequency Channel-selection Reconstruction (OCR-GAN) network to handle anomaly detection task in a perspective of frequency.
We show that our approach markedly surpasses the reconstruction-based baseline by +38.1 and the current SOTA method by +0.3.
arXiv Detail & Related papers (2022-03-01T06:35:15Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Self-Attention Channel Combinator Frontend for End-to-End Multichannel
Far-field Speech Recognition [1.0276024900942875]
When a sufficiently large far-field training data is presented, jointly optimizing a multichannel and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results.
Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Varianceless Response) or fixed beamformers can be successfully integrated into an E2E ASR system with learnable parameters.
We propose the self-attention channel Distortionator (SACC) ASR, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain.
arXiv Detail & Related papers (2021-09-10T11:03:43Z) - FedRec: Federated Learning of Universal Receivers over Fading Channels [92.15358738530037]
We propose a neural network-based symbol detection technique for downlink fading channels.
Multiple users collaborate to jointly learn a universal data-driven detector, hence the name FedRec.
The performance of the resulting receiver is shown to approach the MAP performance in diverse channel conditions without requiring knowledge of the fading statistics.
arXiv Detail & Related papers (2020-11-14T11:29:55Z) - Robust Multi-channel Speech Recognition using Frequency Aligned Network [23.397670239950187]
We use frequency aligned network for robust automatic speech recognition.
We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
arXiv Detail & Related papers (2020-02-06T21:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.