Related papers: Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

URL: http://arxiv.org/abs/2505.20745v2
Date: Thu, 29 May 2025 17:51:17 GMT
Title: Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation
Authors: Jingping Nie, Dung T. Tran, Karan Thakkar, Vasudha Kowtha, Jon Huang, Carlos Avendano, Erdrin Azemi, Vikramjit Mitra,
Abstract summary: Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information.<n>Recently, self-supervised acoustic representation foundation models (FMs) have been proposed to offer insights into acoustics-based vital signs.
Score: 3.1379239557375223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information. Recently, self-supervised acoustic representation foundation models (FMs) have been proposed to offer insights into acoustics-based vital signs. However, there has been little exploration of the extent to which auscultation is encoded in these pre-trained FM representations. In this work, using a publicly available phonocardiogram (PCG) dataset and a heart rate (HR) estimation model, we conduct a layer-wise investigation of six acoustic representation FMs: HuBERT, wav2vec2, wavLM, Whisper, Contrastive Language-Audio Pretraining (CLAP), and an in-house CLAP model. Additionally, we implement the baseline method from Nie et al., 2024 (which relies on acoustic features) and show that overall, representation vectors from pre-trained foundation models (FMs) offer comparable performance to the baseline. Notably, HR estimation using the representations from the audio encoder of the in-house CLAP model outperforms the results obtained from the baseline, achieving a lower mean absolute error (MAE) across various train/validation/test splits despite the domain mismatch.

Related papers

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis [19.205671029694074]
This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.
arXiv Detail & Related papers (2024-07-23T12:00:44Z)
RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification [2.812716452984433]
This paper explores the efficacy of pretrained speech models for respiratory sound classification. We find that there is a characterization gap between speech and lung sound samples, and to bridge this gap, data augmentation is essential. We propose RepAugment, an input-agnostic representation-level augmentation technique that outperforms SpecAugment.
arXiv Detail & Related papers (2024-05-05T16:45:46Z)
Impact of Noisy Supervision in Foundation Model Learning [91.56591923244943]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.<n>We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Pre-trained Model Representations and their Robustness against Noise for Speech Emotion Analysis [6.382013662443799]
We used multi-modal fusion representations from pre-trained models to generate state-of-the-art speech emotion estimation. We discovered that lexical representations are more robust to distortions compared to acoustic representations.
arXiv Detail & Related papers (2023-03-03T18:22:32Z)
Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations. Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation. We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z)
A Causal Intervention Scheme for Semantic Segmentation of Quasi-periodic Cardiovascular Signals [7.182731690965173]
We propose contrastive causal intervention (CCI) to form a novel training paradigm under a frame-level contrastive framework. The intervention can eliminate the implicit statistical bias brought by the single attribute and lead to more objective representations.
arXiv Detail & Related papers (2022-09-19T13:54:51Z)
Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers [79.60022233109397]
We present spatial prior attention (SPAN), a framework that takes advantage of consistent spatial and semantic structure in unlabeled image datasets. SPAN operates by regularizing attention masks from separate transformer heads to follow various priors over semantic regions. We find that the resulting attention masks are more interpretable than those derived from domain-agnostic pretraining.
arXiv Detail & Related papers (2022-09-07T02:30:36Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.