Exploring Speaker Diarization with Mixture of Experts
- URL: http://arxiv.org/abs/2506.14750v1
- Date: Tue, 17 Jun 2025 17:42:54 GMT
- Title: Exploring Speaker Diarization with Mixture of Experts
- Authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Hang Chen, Jun Du,
- Abstract summary: We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture.<n>The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.
- Score: 39.02603646215667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in speaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.
Related papers
- A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations [24.302280709646563]
We propose a modular Mixture-of-Experts for Recognition of Emotions (MiSTER-E) framework to decouple two core challenges in Emotion Recognition in Conversations (ERC)<n>MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings.<n>The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism.
arXiv Detail & Related papers (2026-02-26T18:08:40Z) - Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder [53.00939565103065]
We present a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks.<n>We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels.<n>This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data.
arXiv Detail & Related papers (2025-08-28T06:50:57Z) - Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z) - MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition [23.406334722946163]
MoHAVE (Mixture of Hierarchical Audio-Visual Experts) is a novel robust AVSR framework designed to address scalability constraints.<n>MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead.
arXiv Detail & Related papers (2025-02-11T11:01:05Z) - MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.<n>This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding
with Sequence-to-Sequence Architecture [45.476602010520764]
We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding and sequence-to-sequence architecture.
NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set.
arXiv Detail & Related papers (2023-09-17T07:08:06Z) - High-resolution embedding extractor for speaker diarisation [15.392429990363492]
This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE)
HEE consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success.
Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set.
arXiv Detail & Related papers (2022-11-08T07:41:18Z) - The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party
meeting transcription (M2MeT) challenge [43.262531688434215]
We propose two improvements to target-speaker voice activity detection (TS-VAD)
These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition.
arXiv Detail & Related papers (2022-02-10T06:06:48Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - Improving Multi-Scale Aggregation Using Feature Pyramid Module for
Robust Speaker Verification of Variable-Duration Utterances [15.887661651035712]
We propose a module that enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections.
It achieves better performance than state-of-the-art approaches for both short and long utterances.
arXiv Detail & Related papers (2020-04-07T08:35:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.