Multi-Channel End-to-End Neural Diarization with Distributed Microphones
- URL: http://arxiv.org/abs/2110.04694v1
- Date: Sun, 10 Oct 2021 03:24:03 GMT
- Title: Multi-Channel End-to-End Neural Diarization with Distributed Microphones
- Authors: Shota Horiguchi, Yuki Takashima, Paola Garcia, Shinji Watanabe, Yohei
Kawaguchi
- Abstract summary: We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
- Score: 53.99406868339701
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress on end-to-end neural diarization (EEND) has enabled
overlap-aware speaker diarization with a single neural network. This paper
proposes to enhance EEND by using multi-channel signals from distributed
microphones. We replace Transformer encoders in EEND with two types of encoders
that process a multi-channel input: spatio-temporal and co-attention encoders.
Both are independent of the number and geometry of microphones and suitable for
distributed microphone settings. We also propose a model adaptation method
using only single-channel recordings. With simulated and real-recorded
datasets, we demonstrated that the proposed method outperformed conventional
EEND when a multi-channel input was given while maintaining comparable
performance with a single-channel input. We also showed that the proposed
method performed well even when spatial information is inoperative given
multi-channel inputs, such as in hybrid meetings in which the utterances of
multiple remote participants are played back from the same loudspeaker.
Related papers
- End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder
and Input Feature Analysis [0.0]
We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder.
arXiv Detail & Related papers (2023-10-16T06:40:18Z) - Joint Channel Estimation and Feedback with Masked Token Transformers in
Massive MIMO Systems [74.52117784544758]
This paper proposes an encoder-decoder based network that unveils the intrinsic frequency-domain correlation within the CSI matrix.
The entire encoder-decoder network is utilized for channel compression.
Our method outperforms state-of-the-art channel estimation and feedback techniques in joint tasks.
arXiv Detail & Related papers (2023-06-08T06:15:17Z) - On Neural Architectures for Deep Learning-based Source Separation of
Co-Channel OFDM Signals [104.11663769306566]
We study the single-channel source separation problem involving frequency-division multiplexing (OFDM) signals.
We propose critical domain-informed modifications to the network parameterization, based on insights from OFDM structures.
arXiv Detail & Related papers (2023-03-11T16:29:13Z) - MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware
Beamforming Network for Speech Separation [55.533789120204055]
We propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal.
Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source.
arXiv Detail & Related papers (2022-12-07T01:52:40Z) - Mutual Learning of Single- and Multi-Channel End-to-End Neural
Diarization [34.65357110940456]
This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately.
We introduce an end-to-end neural diarization model that can handle both single- and multi-channel inputs.
Experimental results on two-speaker data show that the proposed method mutually improved single- and multi-channel speaker diarization performances.
arXiv Detail & Related papers (2022-10-07T11:03:32Z) - Self-Attention Channel Combinator Frontend for End-to-End Multichannel
Far-field Speech Recognition [1.0276024900942875]
When a sufficiently large far-field training data is presented, jointly optimizing a multichannel and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results.
Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Varianceless Response) or fixed beamformers can be successfully integrated into an E2E ASR system with learnable parameters.
We propose the self-attention channel Distortionator (SACC) ASR, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain.
arXiv Detail & Related papers (2021-09-10T11:03:43Z) - Model-Driven Deep Learning Based Channel Estimation and Feedback for
Millimeter-Wave Massive Hybrid MIMO Systems [61.78590389147475]
This paper proposes a model-driven deep learning (MDDL)-based channel estimation and feedback scheme for millimeter-wave (mmWave) systems.
To reduce the uplink pilot overhead for estimating the high-dimensional channels from a limited number of radio frequency (RF) chains, we propose to jointly train the phase shift network and the channel estimator as an auto-encoder.
Numerical results show that the proposed MDDL-based channel estimation and feedback scheme outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2021-04-22T13:34:53Z) - Continuous Speech Separation with Ad Hoc Microphone Arrays [35.87274524040486]
Speech separation has been shown effective for multi-talker speech recognition.
In this paper, we extend this approach to continuous speech separation.
Two methods are proposed to mitigate a speech problem during single talker segments.
arXiv Detail & Related papers (2021-03-03T13:01:08Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Neural Speech Separation Using Spatially Distributed Microphones [19.242927805448154]
This paper proposes a neural network based speech separation method using spatially distributed microphones.
Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance.
Speech recognition experimental results show that the proposed method significantly outperforms baseline multi-channel speech separation systems.
arXiv Detail & Related papers (2020-04-28T17:16:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.