StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion
- URL: http://arxiv.org/abs/2512.13632v1
- Date: Mon, 15 Dec 2025 18:28:39 GMT
- Title: StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion
- Authors: Guransh Singh, Md Shah Fahad,
- Abstract summary: Stuttering detection breaks down when disfluencies overlap.<n>Existing parametric models struggle to distinguish complex, simultaneous disfluencies.<n>We introduce StutterFuse, the first Retrieval-Augmented generalization (RAC) for multi-label detection.
- Score: 0.40105987447353786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stuttering detection breaks down when disfluencies overlap. Existing parametric models struggle to distinguish complex, simultaneous disfluencies (e.g., a 'block' with a 'prolongation') due to the scarcity of these specific combinations in training data. While Retrieval-Augmented Generation (RAG) has revolutionized NLP by grounding models in external knowledge, this paradigm remains unexplored in pathological speech processing. To bridge this gap, we introduce StutterFuse, the first Retrieval-Augmented Classifier (RAC) for multi-label stuttering detection. By conditioning a Conformer encoder on a non-parametric memory bank of clinical examples, we allow the model to classify by reference rather than memorization. We further identify and solve "Modality Collapse", an "Echo Chamber" effect where naive retrieval boosts recall but degrades precision. We mitigate this using: (1) SetCon, a Jaccard-Weighted Metric Learning objective that optimizes for multi-label set similarity, and (2) a Gated Mixture-of-Experts fusion strategy that dynamically arbitrates between acoustic evidence and retrieved context. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, outperforming strong baselines and demonstrating remarkable zero-shot cross-lingual generalization.
Related papers
- ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models [12.527207210862151]
ReHear is a framework for iterative pseudo-label refinement in automatic speech recognition.<n>It integrates an instruction-tuned, audio-aware large language model into the self-training loop.<n>We show that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
arXiv Detail & Related papers (2026-02-21T05:04:22Z) - Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection [105.14032334647932]
Machine-generated texts (MGTs) pose risks such as disinformation and phishing, highlighting the need for reliable detection.<n> Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting.<n>We propose a Markov-informed score calibration strategy that models two relationships of context detection scores that may aid calibration.
arXiv Detail & Related papers (2026-02-08T16:06:12Z) - Robust Duality Learning for Unsupervised Visible-Infrared Person Re-Identification [24.24793934981947]
We introduce a new learning paradigm that considers Pseudo-Label Noise (PLN)<n>PLN is characterized by three key challenges: noise overfitting, error accumulation, and noisy cluster correspondence.<n>We propose a novel Robust Duality Learning framework (RoDE) for UVI-ReID to mitigate the effects of noisy pseudo-labels.
arXiv Detail & Related papers (2025-05-05T10:36:52Z) - Robust Learning under Hybrid Noise [24.36707245704713]
We propose a novel unified learning framework called "Feature and Label Recovery" (FLR) to combat the hybrid noise from the perspective of data recovery.
arXiv Detail & Related papers (2024-07-04T16:13:25Z) - Decoupled Prototype Learning for Reliable Test-Time Adaptation [50.779896759106784]
Test-time adaptation (TTA) is a task that continually adapts a pre-trained source model to the target domain during inference.
One popular approach involves fine-tuning model with cross-entropy loss according to estimated pseudo-labels.
This study reveals that minimizing the classification error of each sample causes the cross-entropy loss's vulnerability to label noise.
We propose a novel Decoupled Prototype Learning (DPL) method that features prototype-centric loss computation.
arXiv Detail & Related papers (2024-01-15T03:33:39Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Combating Label Noise With A General Surrogate Model For Sample Selection [77.45468386115306]
We propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically.<n>We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets.
arXiv Detail & Related papers (2023-10-16T14:43:27Z) - Learning Prompt-Enhanced Context Features for Weakly-Supervised Video
Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges.
We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability.
Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv Detail & Related papers (2023-06-26T06:45:16Z) - Mitigating Closed-model Adversarial Examples with Bayesian Neural
Modeling for Enhanced End-to-End Speech Recognition [18.83748866242237]
We focus on a rigorous and empirical "closed-model adversarial robustness" setting.
We propose an advanced Bayesian neural network (BNN) based adversarial detector.
We improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets.
arXiv Detail & Related papers (2022-02-17T09:17:58Z) - S3: Supervised Self-supervised Learning under Label Noise [53.02249460567745]
In this paper we address the problem of classification in the presence of label noise.
In the heart of our method is a sample selection mechanism that relies on the consistency between the annotated label of a sample and the distribution of the labels in its neighborhood in the feature space.
Our method significantly surpasses previous methods on both CIFARCIFAR100 with artificial noise and real-world noisy datasets such as WebVision and ANIMAL-10N.
arXiv Detail & Related papers (2021-11-22T15:49:20Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.