Related papers: Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling

Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling

URL: http://arxiv.org/abs/2601.04744v1
Date: Thu, 08 Jan 2026 09:10:16 GMT
Title: Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling
Authors: Xingyuan Li, Mengyue Wu,
Abstract summary: We propose a novel framework for learning to detect medical conditions from speech acoustics.<n>Our end-to-end approach dynamically aggregates multi-granularity features and generates high-quality pseudo-labels.<n>This work provides a principled approach to learning from weak, far-end supervision in medical speech analysis.
Score: 27.224093715611534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting medical conditions from speech acoustics is fundamentally a weakly-supervised learning problem: a single, often noisy, session-level label must be linked to nuanced patterns within a long, complex audio recording. This task is further hampered by severe data scarcity and the subjective nature of clinical annotations. While semi-supervised learning (SSL) offers a viable path to leverage unlabeled data, existing audio methods often fail to address the core challenge that pathological traits are not uniformly expressed in a patient's speech. We propose a novel, audio-only SSL framework that explicitly models this hierarchy by jointly learning from frame-level, segment-level, and session-level representations within unsegmented clinical dialogues. Our end-to-end approach dynamically aggregates these multi-granularity features and generates high-quality pseudo-labels to efficiently utilize unlabeled data. Extensive experiments show the framework is model-agnostic, robust across languages and conditions, and highly data-efficient-achieving, for instance, 90\% of fully-supervised performance using only 11 labeled samples. This work provides a principled approach to learning from weak, far-end supervision in medical speech analysis.

Related papers

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis [14.922065513695294]
Resp-Agent is an autonomous multimodal system orchestrated by a novel Active Adrial Curriculum Agent (Thinker-A$2$CA)<n>To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention.<n>To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection.
arXiv Detail & Related papers (2026-02-16T14:48:24Z)
Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding [15.79973026677169]
Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance.<n>We introduce AcuLa, a framework that instills semantic understanding into any audio encoder by aligning it with a medical language model.<n>Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools.
arXiv Detail & Related papers (2025-12-04T14:30:58Z)
Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech [51.14752758616364]
Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments.<n>We propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework.<n>The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.
arXiv Detail & Related papers (2025-10-05T09:32:12Z)
An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking Assessment [14.003981407136072]
Self-supervised learning representations capture rich acoustic and linguistic patterns in non-native speech without underlying assumptions of feature curation.<n>Most prior arts treat proficiency levels as nominal classes, ignoring their ordinal structure and non-uniform intervals between proficiency labels.<n>We propose an effective ASA approach combining SSL with handcrafted indicator features via a novel modeling paradigm.
arXiv Detail & Related papers (2025-08-27T09:18:51Z)
Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling [21.82879779173242]
The lack of labeled data is a common challenge in speech classification tasks.<n>We propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method.<n>We evaluate our SSL framework on emotion recognition and dementia detection tasks.
arXiv Detail & Related papers (2024-09-25T13:51:19Z)
Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z)
Combating Label Noise With A General Surrogate Model For Sample Selection [77.45468386115306]
We propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically.<n>We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets.
arXiv Detail & Related papers (2023-10-16T14:43:27Z)
Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z)
Robust Medical Image Classification from Noisy Labeled Data with Global and Local Representation Guided Co-training [73.60883490436956]
We propose a novel collaborative training paradigm with global and local representation learning for robust medical image classification. We employ the self-ensemble model with a noisy label filter to efficiently select the clean and noisy samples. We also design a novel global and local representation learning scheme to implicitly regularize the networks to utilize noisy samples.
arXiv Detail & Related papers (2022-05-10T07:50:08Z)
Speech Detection For Child-Clinician Conversations In Danish For Low-Resource In-The-Wild Conditions: A Case Study [6.4461798613033405]
We study the performance of a pre-trained speech model on a dataset comprising of child-clinician conversations in Danish. We learned that the model with default classification threshold performs worse on children from the patient group. Our study on few-instance adaptation shows that three-minutes of clinician-child conversation is sufficient to obtain the optimum classification threshold.
arXiv Detail & Related papers (2022-04-25T10:51:54Z)
Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors. End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network. We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.