Related papers: Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture

Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture

URL: http://arxiv.org/abs/2309.09180v2
Date: Tue, 26 Dec 2023 07:33:46 GMT
Title: Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture
Authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Yanyan Yue, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee
Abstract summary: We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding and sequence-to-sequence architecture. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set.
Score: 45.476602010520764
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code will be available at https://github.com/liyunlongaaa/NSD-MS2S.

Related papers

Generative Early Stage Ranking [14.15517442047903]
We propose a Generative Early Stage Ranking (GESR) paradigm to balance effectiveness and efficiency.<n>The GESR paradigm has shown substantial improvements in topline metrics, engagement, and consumption tasks.<n>To the best of our knowledge, this marks the first successful deployment of full target-aware attention sequence modeling within an ESR stage at such a scale.
arXiv Detail & Related papers (2025-11-26T06:29:18Z)
IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing [37.95536541492917]
Spiking Neural Networks (SNNs) offer energy-efficient alternatives to traditional Artificial Neural Networks (ANNs)<n>IML-Spikeformer is a spiking Transformer architecture specifically designed for large-scale speech processing.<n>IML-Spikeformer achieves word error rates of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers.
arXiv Detail & Related papers (2025-07-10T03:26:24Z)
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z)
Exploring Speaker Diarization with Mixture of Experts [39.02603646215667]
We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture.<n>The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.
arXiv Detail & Related papers (2025-06-17T17:42:54Z)
BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation [5.716013795091872]
This paper presents a tutorial-style survey and implementation guide of BemaGANv2.<n>BemaGANv2 is an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation.
arXiv Detail & Related papers (2025-06-11T07:57:05Z)
Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields. Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion. We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z)
Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker. Network addresses limitations of SIMO models by aggregating cross-speaker representations. Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Multi-Head State Space Model for Speech Recognition [44.04124537862432]
State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus.
arXiv Detail & Related papers (2023-05-21T16:28:57Z)
The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv Detail & Related papers (2022-02-03T14:38:26Z)
Self-Gated Memory Recurrent Network for Efficient Scalable HDR Deghosting [59.04604001936661]
We propose a novel recurrent network-based HDR deghosting method for fusing arbitrary length dynamic sequences. We introduce a new recurrent cell architecture, namely Self-Gated Memory (SGM) cell, that outperforms the standard LSTM cell. The proposed approach achieves state-of-the-art performance compared to existing HDR deghosting methods quantitatively across three publicly available datasets.
arXiv Detail & Related papers (2021-12-24T12:36:33Z)
Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms. With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z)
Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR) The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z)
MCSAE: Masked Cross Self-Attentive Encoding for Speaker Embedding [8.942112181408158]
We propose masked cross self-attentive encoding (MCSAE) using ResNet. It focuses on the features of both high-level and lowlevel layers. The experimental results showed an equal error rate of 2.63% and a minimum detection cost function of 0.1453.
arXiv Detail & Related papers (2020-01-28T04:09:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.