A Dynamic Framework for Semantic Grouping of Common Data Elements (CDE) Using Embeddings and Clustering
- URL: http://arxiv.org/abs/2506.02160v1
- Date: Mon, 02 Jun 2025 18:43:37 GMT
- Title: A Dynamic Framework for Semantic Grouping of Common Data Elements (CDE) Using Embeddings and Clustering
- Authors: Madan Krishnamurthy, Daniel Korn, Melissa A Haendel, Christopher J Mungall, Anne E Thessen,
- Abstract summary: This research aims to develop a dynamic and scalable framework to facilitate harmonization of Common Data Elements (CDEs) across heterogeneous biomedical datasets.<n>Our methodology leverages Large Language Models (LLMs) for context-aware text embeddings that convert CDEs into dense vectors capturing semantic relationships and patterns.
- Score: 0.782496834711349
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This research aims to develop a dynamic and scalable framework to facilitate harmonization of Common Data Elements (CDEs) across heterogeneous biomedical datasets by addressing challenges such as semantic heterogeneity, structural variability, and context dependence to streamline integration, enhance interoperability, and accelerate scientific discovery. Our methodology leverages Large Language Models (LLMs) for context-aware text embeddings that convert CDEs into dense vectors capturing semantic relationships and patterns. These embeddings are clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to group semantically similar CDEs. The framework incorporates four key steps: (1) LLM-based text embedding to mathematically represent semantic context, (2) unsupervised clustering of embeddings via HDBSCAN, (3) automated labeling using LLM summarization, and (4) supervised learning to train a classifier assigning new or unclustered CDEs to labeled clusters. Evaluated on the NIH NLM CDE Repository with over 24,000 CDEs, the system identified 118 meaningful clusters at an optimized minimum cluster size of 20. The classifier achieved 90.46 percent overall accuracy, performing best in larger categories. External validation against Gravity Projects Social Determinants of Health domains showed strong agreement (Adjusted Rand Index 0.52, Normalized Mutual Information 0.78), indicating that embeddings effectively capture cluster characteristics. This adaptable and scalable approach offers a practical solution to CDE harmonization, improving selection efficiency and supporting ongoing data interoperability.
Related papers
- MacNet: An End-to-End Manifold-Constrained Adaptive Clustering Network for Interpretable Whole Slide Image Classification [9.952997875404634]
Clustering-based approaches can provide explainable decision-making process but suffer from high dimension features and semantically ambiguous centroids.<n>We propose an end-to-end MIL framework that integrates Grassmann re-embedding and manifold adaptive clustering.<n> Experiments on multicentre WSI datasets demonstrate that: 1) our cluster-incorporated model achieves superior performance in both grading accuracy and interpretability; 2) end-to-end learning refines better feature representations and it requires acceptable resources.
arXiv Detail & Related papers (2026-02-16T06:43:36Z) - Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models [64.58262227709842]
ARISE (Attention-weighted Representation with Integrated Semantic Embeddings) is presented.<n>It builds semantic-aware representations that complement the metric space of categorical data for accurate clustering.<n>Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts.
arXiv Detail & Related papers (2026-01-03T11:37:46Z) - ESMC: MLLM-Based Embedding Selection for Explainable Multiple Clustering [79.69917150582633]
Multi-modal large language models (MLLMs) can be leveraged to achieve user-driven clustering.<n>Our method first discovers that MLLMs' hidden states of text tokens are strongly related to the corresponding features.<n>We also employ a lightweight clustering head augmented with pseudo-label learning, significantly enhancing clustering accuracy.
arXiv Detail & Related papers (2025-11-30T04:36:51Z) - ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation [9.419239935565376]
ClustRecNet is a novel deep learning (DL)-based recommendation framework for determining the most suitable clustering algorithms for a given dataset.<n>We construct a comprehensive data repository comprising 34,000 synthetic datasets with diverse structural properties.<n>The proposed network architecture integrates convolutional, residual, and attention mechanisms to capture both local and global structural patterns.
arXiv Detail & Related papers (2025-09-29T13:48:33Z) - IOCC: Aligning Semantic and Cluster Centers for Few-shot Short Text Clustering [15.657808381423736]
In clustering tasks, it is essential to structure the feature space into clear, well-separated distributions.<n>We propose IOCC, a novel few-shot contrastive learning method that achieves alignment between the cluster centers and semantic centers.<n>IOCC outperforms previous methods, achieving up to 7.34% improvement on challenging Biomedical dataset.
arXiv Detail & Related papers (2025-08-08T08:47:13Z) - Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z) - Manifold Clustering with Schatten p-norm Maximization [16.90743611125625]
We develop a new clustering framework based on manifold clustering.<n>Specifically, the algorithm uses labels to guide the manifold structure and perform clustering on it.<n>In order to naturally maintain the class balance in the clustering process, we maximize the Schatten p-norm of labels.
arXiv Detail & Related papers (2025-04-29T03:23:06Z) - Hierarchical clustering with maximum density paths and mixture models [44.443538161979056]
t-NEB is a probabilistically grounded hierarchical clustering method.<n>It yields state-of-the-art clustering performance on naturalistic high-dimensional data.
arXiv Detail & Related papers (2025-03-19T15:37:51Z) - Dial-In LLM: Human-Aligned LLM-in-the-loop Intent Clustering for Customer Service Dialogues [13.891718772119575]
This paper proposes an LLM-in-the-loop intent clustering framework.<n>It integrates the semantic understanding capabilities of LLMs into conventional clustering algorithms.<n>It achieves over 95% accuracy aligned with human judgments.
arXiv Detail & Related papers (2024-12-12T08:19:01Z) - AdaptiveMDL-GenClust: A Robust Clustering Framework Integrating Normalized Mutual Information and Evolutionary Algorithms [0.0]
We introduce a robust clustering framework that integrates the Minimum Description Length (MDL) principle with a genetic optimization algorithm.<n>The framework begins with an ensemble clustering approach to generate an initial clustering solution, which is refined using MDL-guided evaluation functions and optimized through a genetic algorithm.<n> Experimental results demonstrate that our approach consistently outperforms traditional clustering methods, yielding higher accuracy, improved stability, and reduced bias.
arXiv Detail & Related papers (2024-11-26T20:26:14Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - FedAC: An Adaptive Clustered Federated Learning Framework for Heterogeneous Data [21.341280782748278]
Clustered federated learning (CFL) is proposed to mitigate the performance deterioration stemming from data heterogeneity inFL.
We propose an adaptive CFL framework, named FedAC, which efficiently integrates global knowledge into intra-cluster learning.
Experiments show that FedAC achieves superior empirical performance, increasing the test accuracy by around 1.82% and 12.67%.
arXiv Detail & Related papers (2024-03-25T06:43:28Z) - A Framework for Joint Unsupervised Learning of Cluster-Aware Embedding
for Heterogeneous Networks [6.900303913555705]
Heterogeneous Information Network (HIN) embedding refers to the low-dimensional projections of the HIN nodes that preserve the HIN structure and semantics.
We propose ours for joint learning of cluster embeddings as well as cluster-aware HIN embedding.
arXiv Detail & Related papers (2021-08-09T11:36:36Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain
Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain.
We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one.
Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Contrastive Clustering [57.71729650297379]
We propose Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning.
In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19% (39%) performance improvement compared with the best baseline.
arXiv Detail & Related papers (2020-09-21T08:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.