Related papers: SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis

SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis

URL: http://arxiv.org/abs/2510.02322v1
Date: Wed, 24 Sep 2025 15:17:21 GMT
Title: SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis
Authors: Lukas Buess, Jan Geier, David Bani-Harouni, Chantal Pellegrini, Matthias Keicher, Paula Andrea Perez-Toro, Nassir Navab, Andreas Maier, Tomas Arias-Vergara,
Abstract summary: We train a contrastive model that aligns speech and 3D CT volumes in a shared representation space.<n>Experiments demonstrate improved zero-shot classification F1 from 0.623 to 0.705, recovering 88% of the performance difference.<n>These findings highlight speech as a practical alternative to text in multimodal pretraining and open the door to voice-driven diagnostic support tools in clinical practice.
Score: 33.90335501244261
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spoken communication plays a central role in clinical workflows. In radiology, for example, most reports are created through dictation. Yet, nearly all medical AI systems rely exclusively on written text. In this work, we address this gap by exploring the feasibility of learning visual-language representations directly from spoken radiology reports. Specifically, we synthesize a large-scale dataset (Speech-RATE) of spoken radiology reports and train SpeechCT-CLIP, a contrastive model that aligns speech and 3D CT volumes in a shared representation space. While naive speech-based models underperform compared to text-trained counterparts, we show that knowledge distillation from a pretrained text-image CLIP model effectively transfers semantic alignment capabilities from text to speech, substantially narrowing this gap. Experiments demonstrate improved zero-shot classification F1 from 0.623 to 0.705, recovering 88% of the performance difference, and strong retrieval results without requiring text at inference. These findings highlight speech as a practical alternative to text in multimodal pretraining and open the door to voice-driven diagnostic support tools in clinical practice.

Related papers

SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation [30.493851883411878]
We propose SpeechMedAssist, a SpeechLM capable of conducting speech-based multi-turn interactions with patients.<n>By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm.<n>Our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.
arXiv Detail & Related papers (2026-01-08T06:14:58Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z)
Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis [8.5693791544413]
We propose Neural LCS, a novel approach for dysfluent text-text and speech-text alignment.<n>We evaluate our method on a large-scale simulated dataset.<n>Our results demonstrate the potential of Neural LCS to enhance automated systems for diagnosing and analyzing speech disorders.
arXiv Detail & Related papers (2025-06-05T03:06:37Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model [36.9873998348851]
Speech-To-Text (STT) has the potential to significantly reduce the workload of clinicians who rely on typists to transcribe their voice recordings. We propose a medical-domain text correction method that modifies the output text of a general STT system. Our experiments demonstrate that the proposed method offers quantitatively and clinically significant improvements in STT performance in the medical field.
arXiv Detail & Related papers (2023-02-27T08:06:04Z)
Decoding speech perception from non-invasive brain recordings [48.46819575538446]
We introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from non-invasive recordings. Our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities.
arXiv Detail & Related papers (2022-08-25T10:01:43Z)
Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT) In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech [8.465993273653554]
We investigate the use of a multi-speaker Text-To-Speech system to synthesize speech in support of speaker recognition. We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance. We also explore the effectiveness of different types of text transcripts used for TTS synthesis.
arXiv Detail & Related papers (2020-11-24T00:48:54Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.