Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding
with Sequence-to-Sequence Architecture
- URL: http://arxiv.org/abs/2309.09180v2
- Date: Tue, 26 Dec 2023 07:33:46 GMT
- Title: Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding
with Sequence-to-Sequence Architecture
- Authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Yanyan Yue,
Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee
- Abstract summary: We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding and sequence-to-sequence architecture.
NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set.
- Score: 45.476602010520764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel neural speaker diarization system using memory-aware
multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S),
which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE)
and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both
efficiency and performance. Next, we further decrease the memory occupation of
decoding by incorporating input features fusion and then employ a multi-head
attention mechanism to capture features at different levels. NSD-MS2S achieved
a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which
signifies a relative improvement of 49% over the official baseline system, and
is the key technique for us to achieve the best performance for the main track
of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module
(DIM) in MA-MSE module to better retrieve a cleaner and more discriminative
multi-speaker embedding, enabling the current model to outperform the system we
used in the CHiME-7 DASR Challenge. Our code will be available at
https://github.com/liyunlongaaa/NSD-MS2S.
Related papers
- Generative Early Stage Ranking [14.15517442047903]
We propose a Generative Early Stage Ranking (GESR) paradigm to balance effectiveness and efficiency.<n>The GESR paradigm has shown substantial improvements in topline metrics, engagement, and consumption tasks.<n>To the best of our knowledge, this marks the first successful deployment of full target-aware attention sequence modeling within an ESR stage at such a scale.
arXiv Detail & Related papers (2025-11-26T06:29:18Z) - IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing [37.95536541492917]
Spiking Neural Networks (SNNs) offer energy-efficient alternatives to traditional Artificial Neural Networks (ANNs)<n>IML-Spikeformer is a spiking Transformer architecture specifically designed for large-scale speech processing.<n>IML-Spikeformer achieves word error rates of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers.
arXiv Detail & Related papers (2025-07-10T03:26:24Z) - Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z) - Exploring Speaker Diarization with Mixture of Experts [39.02603646215667]
We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture.<n>The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.
arXiv Detail & Related papers (2025-06-17T17:42:54Z) - BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation [5.716013795091872]
This paper presents a tutorial-style survey and implementation guide of BemaGANv2.<n>BemaGANv2 is an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation.
arXiv Detail & Related papers (2025-06-11T07:57:05Z) - Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Multi-Head State Space Model for Speech Recognition [44.04124537862432]
State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks.
In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms.
As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus.
arXiv Detail & Related papers (2023-05-21T16:28:57Z) - The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge.
We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data.
Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv Detail & Related papers (2022-02-03T14:38:26Z) - Self-Gated Memory Recurrent Network for Efficient Scalable HDR
Deghosting [59.04604001936661]
We propose a novel recurrent network-based HDR deghosting method for fusing arbitrary length dynamic sequences.
We introduce a new recurrent cell architecture, namely Self-Gated Memory (SGM) cell, that outperforms the standard LSTM cell.
The proposed approach achieves state-of-the-art performance compared to existing HDR deghosting methods quantitatively across three publicly available datasets.
arXiv Detail & Related papers (2021-12-24T12:36:33Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - Unsupervised Speaker Adaptation using Attention-based Speaker Memory for
End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR)
The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism.
We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z) - MCSAE: Masked Cross Self-Attentive Encoding for Speaker Embedding [8.942112181408158]
We propose masked cross self-attentive encoding (MCSAE) using ResNet.
It focuses on the features of both high-level and lowlevel layers.
The experimental results showed an equal error rate of 2.63% and a minimum detection cost function of 0.1453.
arXiv Detail & Related papers (2020-01-28T04:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.