StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks
- URL: http://arxiv.org/abs/2603.00355v1
- Date: Fri, 27 Feb 2026 22:39:23 GMT
- Title: StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks
- Authors: Yishan Wang, Tsai-Ning Wang, Mathias Funk, Aaqib Saeed,
- Abstract summary: We present StethoLM, the first audio-language model specialized for cardiopulmonary auscultation.<n>It is capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis.<n>Our work establishes a foundation for instruction-following AI systems in clinical auscultation.
- Score: 14.936669090239548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Listening to heart and lung sounds - auscultation - is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support. We present StethoLM, the first audio-language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction-response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories: binary classification, detection, reporting, reasoning, differential diagnosis, comparison, and location-based analysis. Through multi-stage training that combines supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data. Our work establishes a foundation for instruction-following AI systems in clinical auscultation.
Related papers
- ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning [58.01333341218153]
We propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.<n>Our method generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent.<n>Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-12-29T12:58:58Z) - Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding [15.79973026677169]
Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance.<n>We introduce AcuLa, a framework that instills semantic understanding into any audio encoder by aligning it with a medical language model.<n>Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools.
arXiv Detail & Related papers (2025-12-04T14:30:58Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - CaReAQA: A Cardiac and Respiratory Audio Question Answering Model for Open-Ended Diagnostic Reasoning [17.462121203082006]
CaReAQA is an audio-language model that integrates a foundation audio model with the reasoning capabilities of large language models.<n>We introduce CaReSound, a benchmark dataset of annotated medical audio recordings enriched with metadata.<n> evaluation results show that CaReAQA achieves 86.2% accuracy on open-ended diagnostic reasoning tasks.
arXiv Detail & Related papers (2025-05-02T11:42:46Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - Deep CardioSound: An Ensembled Deep Learning Model for Heart Sound
MultiLabelling [5.830356769562823]
This work proposes a deep multilabel learning model that can automatically annotate heart sound recordings with labels from different label groups.
Experiment results show that the proposed method has achieved outstanding performance on the holdout data.
arXiv Detail & Related papers (2022-04-15T11:13:11Z) - Assessing clinical utility of Machine Learning and Artificial
Intelligence approaches to analyze speech recordings in Multiple Sclerosis: A
Pilot Study [1.6582693134062305]
The aim of this study was to determine the potential clinical utility of machine learning and deep learning/AI approaches for the aiding of diagnosis, biomarker extraction and progression monitoring of multiple sclerosis using speech recordings.
The Random Forest model performed best, achieving an Accuracy of 0.82 on the validation dataset and an area-under-curve of 0.76 across 5 k-fold cycles on the training dataset.
arXiv Detail & Related papers (2021-09-20T21:02:37Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - Noise-Resilient Automatic Interpretation of Holter ECG Recordings [67.59562181136491]
We present a three-stage process for analysing Holter recordings with robustness to noisy signal.
First stage is a segmentation neural network (NN) with gradientdecoder architecture which detects positions of heartbeats.
Second stage is a classification NN which will classify heartbeats as wide or narrow.
Third stage is a boosting decision trees (GBDT) on top of NN features that incorporates patient-wise features.
arXiv Detail & Related papers (2020-11-17T16:15:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.