Related papers: Exploring Speaker Diarization with Mixture of Experts

Exploring Speaker Diarization with Mixture of Experts

URL: http://arxiv.org/abs/2506.14750v1
Date: Tue, 17 Jun 2025 17:42:54 GMT
Title: Exploring Speaker Diarization with Mixture of Experts
Authors: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Hang Chen, Jun Du,
Abstract summary: We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture.<n>The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.
Score: 39.02603646215667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in speaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.

Related papers

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations [24.302280709646563]
We propose a modular Mixture-of-Experts for Recognition of Emotions (MiSTER-E) framework to decouple two core challenges in Emotion Recognition in Conversations (ERC)<n>MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings.<n>The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism.
arXiv Detail & Related papers (2026-02-26T18:08:40Z)
Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder [53.00939565103065]
We present a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks.<n>We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels.<n>This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data.
arXiv Detail & Related papers (2025-08-28T06:50:57Z)
Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z)
MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition [23.406334722946163]
MoHAVE (Mixture of Hierarchical Audio-Visual Experts) is a novel robust AVSR framework designed to address scalability constraints.<n>MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead.
arXiv Detail & Related papers (2025-02-11T11:01:05Z)
MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.<n>This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture [45.476602010520764]
We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding and sequence-to-sequence architecture. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set.
arXiv Detail & Related papers (2023-09-17T07:08:06Z)
High-resolution embedding extractor for speaker diarisation [15.392429990363492]
This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE) HEE consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success. Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set.
arXiv Detail & Related papers (2022-11-08T07:41:18Z)
The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge [43.262531688434215]
We propose two improvements to target-speaker voice activity detection (TS-VAD) These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition.
arXiv Detail & Related papers (2022-02-10T06:06:48Z)
Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms. With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z)
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances [15.887661651035712]
We propose a module that enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections. It achieves better performance than state-of-the-art approaches for both short and long utterances.
arXiv Detail & Related papers (2020-04-07T08:35:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.