ctELM: Decoding and Manipulating Embeddings of Clinical Trials with Embedding Language Models
- URL: http://arxiv.org/abs/2601.18796v1
- Date: Mon, 26 Jan 2026 18:58:46 GMT
- Title: ctELM: Decoding and Manipulating Embeddings of Clinical Trials with Embedding Language Models
- Authors: Brian Ondov, Chia-Hsuan Chang, Yujia Zhou, Mauro Giuffrè, Hua Xu,
- Abstract summary: We develop an open-source, domain-agnostic ELM architecture and training framework for embeddings of clinical trials.<n>We show that generated trial abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects.
- Score: 10.228049278189474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we align Large Language Models to embeddings of clinical trials using the recently reported Embedding Language Model (ELM) method. We develop an open-source, domain-agnostic ELM architecture and training framework, design training tasks for clinical trials, and introduce an expert-validated synthetic dataset. We then train a series of ELMs exploring the impact of tasks and training regimes. Our final model, ctELM, can accurately describe and compare unseen clinical trials from embeddings alone and produce plausible clinical trials from novel vectors. We further show that generated trial abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.
Related papers
- Rephrasing Electronic Health Records for Pretraining Clinical Language Models [0.09819964822292428]
We rephrase existing clinical notes using LLMs to generate synthetic pretraining corpora.<n>We find that augmenting original clinical notes with synthetic corpora from different LLMs improves performances even at a small token budget.
arXiv Detail & Related papers (2024-11-28T06:12:28Z) - EEG-Language Pretraining for Highly Label-Efficient Clinical Phenotyping [0.0]
Multimodal language modeling has enabled breakthroughs for representation learning, yet remains unexplored in the realm of functional brain data for clinical phenotyping.<n>This paper pioneers EEG-language models (ELMs) trained on clinical reports and 15000 EEGs.
arXiv Detail & Related papers (2024-09-02T10:03:03Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Neural Machine Translation of Clinical Text: An Empirical Investigation
into Multilingual Pre-Trained Language Models and Transfer-Learning [6.822926897514793]
Experimental results on three subtasks including 1) clinical case (CC), 2) clinical terminology (CT), and 3) ontological concept (OC)
Our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data.
The transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish.
arXiv Detail & Related papers (2023-12-12T13:26:42Z) - On Preserving the Knowledge of Long Clinical Texts [0.0]
A bottleneck in using transformer encoders for processing clinical texts comes from the input length limit of these models.<n>This paper proposes a novel method to preserve the knowledge of long clinical texts in the models using aggregated ensembles of transformer encoders.
arXiv Detail & Related papers (2023-11-02T19:50:02Z) - Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models [46.32860360019374]
Large language models (LLMs) have shown promise in this domain, but their direct deployment can lead to privacy issues.<n>We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process.<n>Our empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks.
arXiv Detail & Related papers (2023-11-01T04:37:28Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - A Multi-View Joint Learning Framework for Embedding Clinical Codes and
Text Using Graph Neural Networks [23.06795121693656]
We propose a framework that learns from codes and text to combine the availability and forward-looking nature of text and better performance of ICD codes.
Our approach uses a Graph Neural Network (GNN) to process ICD codes, and Bi-LSTM to process text.
In experiments using planned surgical procedure text, our model outperforms BERT models fine-tuned to clinical data.
arXiv Detail & Related papers (2023-01-27T09:19:03Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - VBridge: Connecting the Dots Between Features, Explanations, and Data
for Healthcare Models [85.4333256782337]
VBridge is a visual analytics tool that seamlessly incorporates machine learning explanations into clinicians' decision-making workflow.
We identified three key challenges, including clinicians' unfamiliarity with ML features, lack of contextual information, and the need for cohort-level evidence.
We demonstrated the effectiveness of VBridge through two case studies and expert interviews with four clinicians.
arXiv Detail & Related papers (2021-08-04T17:34:13Z) - Clinical Named Entity Recognition using Contextualized Token
Representations [49.036805795072645]
This paper introduces the technique of contextualized word embedding to better capture the semantic meaning of each word based on its context.
We pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair)
Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.
arXiv Detail & Related papers (2021-06-23T18:12:58Z) - UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual
Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process.
By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.