Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models
- URL: http://arxiv.org/abs/2505.09805v1
- Date: Wed, 14 May 2025 21:05:40 GMT
- Title: Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models
- Authors: Aditya Nagori, Ayush Gautam, Matthew O. Wiens, Vuong Nguyen, Nathan Kenya Mugisha, Jerome Kabakyenga, Niranjan Kissoon, John Mark Ansermino, Rishikesan Kamaleswaran,
- Abstract summary: This study evaluates Large Language Model (LLM) based clustering against classical methods.<n>Patient records were serialized into text with and without a clustering objective.<n>LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters.
- Score: 2.593361890114316
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Clustering patient subgroups is essential for personalized care and efficient resource use. Traditional clustering methods struggle with high-dimensional, heterogeneous healthcare data and lack contextual understanding. This study evaluates Large Language Model (LLM) based clustering against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated cluster quality and distinctiveness. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight potential of LLMs for contextual phenotyping and informed decision-making in resource-limited settings.
Related papers
- ESMC: MLLM-Based Embedding Selection for Explainable Multiple Clustering [79.69917150582633]
Multi-modal large language models (MLLMs) can be leveraged to achieve user-driven clustering.<n>Our method first discovers that MLLMs' hidden states of text tokens are strongly related to the corresponding features.<n>We also employ a lightweight clustering head augmented with pseudo-label learning, significantly enhancing clustering accuracy.
arXiv Detail & Related papers (2025-11-30T04:36:51Z) - Balancing Complexity and Informativeness in LLM-Based Clustering: Finding the Goldilocks Zone [0.0]
This paper investigates the optimal number of clusters by quantifying the trade-off between informativeness and cognitive simplicity.<n>We use large language models (LLMs) to generate cluster names and evaluate their effectiveness.<n>We identify an optimal range of 16-22 clusters, paralleling linguistic efficiency in lexical categorization.
arXiv Detail & Related papers (2025-04-06T01:16:22Z) - An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets [0.0]
We present an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings.<n>We show significant improvements in clustering metrics such as silhouette score, purity, and adjusted rand index (ARI)<n>The proposed technique will help to bridge the gap between semantic understanding and statistical robustness for large-scale text-mining tasks.
arXiv Detail & Related papers (2025-02-22T08:28:41Z) - Dial-In LLM: Human-Aligned LLM-in-the-loop Intent Clustering for Customer Service Dialogues [18.744211667479995]
This paper investigates the effectiveness of fine-tuned.<n>LLMs in semantic coherence evaluation and intent cluster naming.<n>It also proposes an.<n>LLM-ITL clustering algorithm that facilitates the iterative discovery of.<n>coherent intent clusters.
arXiv Detail & Related papers (2024-12-12T08:19:01Z) - Dirichlet Process-based Robust Clustering using the Median-of-Means Estimator [16.774378814288806]
We propose an efficient and automatic clustering technique by integrating the strengths of model-based and centroid-based methodologies.<n>Our method mitigates the effect of noise on the quality of clustering; while at the same time, estimates the number of clusters.
arXiv Detail & Related papers (2023-11-26T19:01:15Z) - Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering.
We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z) - Simple and Scalable Algorithms for Cluster-Aware Precision Medicine [0.0]
We propose a simple and scalable approach to joint clustering and embedding.
This novel, cluster-aware embedding approach overcomes the complexity and limitations of current joint embedding and clustering methods.
Our approach does not require the user to choose the desired number of clusters, but instead yields interpretable dendrograms of hierarchically clustered embeddings.
arXiv Detail & Related papers (2022-11-29T19:27:26Z) - CAC: A Clustering Based Framework for Classification [20.372627144885158]
We design a simple, efficient, and generic framework called Classification Aware Clustering (CAC)
Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of CAC over previous methods for combined clustering and classification.
arXiv Detail & Related papers (2021-02-23T18:59:39Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Progressive Cluster Purification for Unsupervised Feature Learning [48.87365358296371]
In unsupervised feature learning, sample specificity based methods ignore the inter-class information.
We propose a novel clustering based method, which excludes class inconsistent samples during progressive cluster formation.
Our approach, referred to as Progressive Cluster Purification (PCP), implements progressive clustering by gradually reducing the number of clusters during training.
arXiv Detail & Related papers (2020-07-06T08:11:03Z) - LSD-C: Linearly Separable Deep Clusters [145.89790963544314]
We present LSD-C, a novel method to identify clusters in an unlabeled dataset.
Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation.
We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.
arXiv Detail & Related papers (2020-06-17T17:58:10Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.