Self-Supervised Learning for Multi-Channel Neural Transducer
- URL: http://arxiv.org/abs/2408.02945v1
- Date: Tue, 6 Aug 2024 04:12:31 GMT
- Title: Self-Supervised Learning for Multi-Channel Neural Transducer
- Authors: Atsushi Kojima,
- Abstract summary: We explore a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework.
We observed a 66% relative reduction in character error rate compared with the model without any pre-training for the far-field in-house dataset.
- Score: 3.045851438458641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning, such as with the wav2vec 2.0 framework significantly improves the accuracy of end-to-end automatic speech recognition (ASR). Wav2vec 2.0 has been applied to single-channel end-to-end ASR models. In this work, we explored a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework. As the multi-channel end-to-end ASR model, we focused on a multi-channel neural transducer. In pre-training, we compared three different methods for feature quantization to train a multi-channel conformer audio encoder: joint quantization, feature-wise quantization and channel-wise quantization. In fine-tuning, we trained the multi-channel conformer-transducer. All experiments were conducted using the far-field in-house and CHiME-4 datasets. The results of the experiments showed that feature-wise quantization was the most effective among the methods. We observed a 66% relative reduction in character error rate compared with the model without any pre-training for the far-field in-house dataset.
Related papers
- Mutual Learning of Single- and Multi-Channel End-to-End Neural
Diarization [34.65357110940456]
This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately.
We introduce an end-to-end neural diarization model that can handle both single- and multi-channel inputs.
Experimental results on two-speaker data show that the proposed method mutually improved single- and multi-channel speaker diarization performances.
arXiv Detail & Related papers (2022-10-07T11:03:32Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - ChannelAugment: Improving generalization of multi-channel ASR by
training with input channel randomization [6.42706307642403]
End-to-end (E2E) multi-channel ASR systems show state-of-the-art performance in far-field ASR tasks.
Main limitation of such systems is that they are usually trained with data from a fixed array geometry.
We present a simple and effective data augmentation technique, which is based on randomly dropping channels in the multi-channel audio input during training.
arXiv Detail & Related papers (2021-09-23T09:13:47Z) - Self-Attention Channel Combinator Frontend for End-to-End Multichannel
Far-field Speech Recognition [1.0276024900942875]
When a sufficiently large far-field training data is presented, jointly optimizing a multichannel and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results.
Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Varianceless Response) or fixed beamformers can be successfully integrated into an E2E ASR system with learnable parameters.
We propose the self-attention channel Distortionator (SACC) ASR, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain.
arXiv Detail & Related papers (2021-09-10T11:03:43Z) - Learning to Perform Downlink Channel Estimation in Massive MIMO Systems [72.76968022465469]
We study downlink (DL) channel estimation in a Massive multiple-input multiple-output (MIMO) system.
A common approach is to use the mean value as the estimate, motivated by channel hardening.
We propose two novel estimation methods.
arXiv Detail & Related papers (2021-09-06T13:42:32Z) - Neural Calibration for Scalable Beamforming in FDD Massive MIMO with
Implicit Channel Estimation [10.775558382613077]
Channel estimation and beamforming play critical roles in frequency-division duplexing (FDD) massive multiple-input multiple-output (MIMO) systems.
We propose a deep learning-based approach that directly optimize the beamformers at the base station according to the received uplink pilots.
A neural calibration method is proposed to improve the scalability of the end-to-end design.
arXiv Detail & Related papers (2021-08-03T14:26:14Z) - Attention-based Neural Beamforming Layers for Multi-channel Speech
Recognition [17.009051842682677]
We propose a 2D Conv-Attention module which combines convolution neural networks with attention for beamforming.
We apply self- and cross-attention to explicitly model the correlations within and between the input channels.
The results show a relative improvement of 3.8% in WER by the proposed model over the baseline neural beamformer.
arXiv Detail & Related papers (2021-05-12T19:32:24Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Data-Driven Symbol Detection via Model-Based Machine Learning [117.58188185409904]
We review a data-driven framework to symbol detection design which combines machine learning (ML) and model-based algorithms.
In this hybrid approach, well-known channel-model-based algorithms are augmented with ML-based algorithms to remove their channel-model-dependence.
Our results demonstrate that these techniques can yield near-optimal performance of model-based algorithms without knowing the exact channel input-output statistical relationship.
arXiv Detail & Related papers (2020-02-14T06:58:27Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.