Mutual Learning of Single- and Multi-Channel End-to-End Neural
Diarization
- URL: http://arxiv.org/abs/2210.03459v1
- Date: Fri, 7 Oct 2022 11:03:32 GMT
- Title: Mutual Learning of Single- and Multi-Channel End-to-End Neural
Diarization
- Authors: Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia
- Abstract summary: This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately.
We introduce an end-to-end neural diarization model that can handle both single- and multi-channel inputs.
Experimental results on two-speaker data show that the proposed method mutually improved single- and multi-channel speaker diarization performances.
- Score: 34.65357110940456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the high performance of multi-channel speech processing, we can use
the outputs from a multi-channel model as teacher labels when training a
single-channel model with knowledge distillation. To the contrary, it is also
known that single-channel speech data can benefit multi-channel models by
mixing it with multi-channel speech data during training or by using it for
model pretraining. This paper focuses on speaker diarization and proposes to
conduct the above bi-directional knowledge transfer alternately. We first
introduce an end-to-end neural diarization model that can handle both single-
and multi-channel inputs. Using this model, we alternately conduct i) knowledge
distillation from a multi-channel model to a single-channel model and ii)
finetuning from the distilled single-channel model to a multi-channel model.
Experimental results on two-speaker data show that the proposed method mutually
improved single- and multi-channel speaker diarization performances.
Related papers
- Self-Supervised Learning for Multi-Channel Neural Transducer [3.045851438458641]
We explore a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework.
We observed a 66% relative reduction in character error rate compared with the model without any pre-training for the far-field in-house dataset.
arXiv Detail & Related papers (2024-08-06T04:12:31Z) - End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder
and Input Feature Analysis [0.0]
We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder.
arXiv Detail & Related papers (2023-10-16T06:40:18Z) - Implicit Neural Spatial Filtering for Multichannel Source Separation in
the Waveform Domain [131.74762114632404]
The model is trained end-to-end and performs spatial processing implicitly.
We evaluate the proposed model on a real-world dataset and show that the model matches the performance of an oracle beamformer.
arXiv Detail & Related papers (2022-06-30T17:13:01Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Attention-based Neural Beamforming Layers for Multi-channel Speech
Recognition [17.009051842682677]
We propose a 2D Conv-Attention module which combines convolution neural networks with attention for beamforming.
We apply self- and cross-attention to explicitly model the correlations within and between the input channels.
The results show a relative improvement of 3.8% in WER by the proposed model over the baseline neural beamformer.
arXiv Detail & Related papers (2021-05-12T19:32:24Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Fully Learnable Front-End for Multi-Channel Acoustic Modeling using
Semi-Supervised Learning [20.97480659815297]
We train a fully learnable multi-channel acoustic model for far-field automatic speech recognition.
For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained.
We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly with a beamformer.
arXiv Detail & Related papers (2020-02-01T02:06:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.