Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation
- URL: http://arxiv.org/abs/2501.11199v2
- Date: Sat, 25 Jan 2025 22:44:58 GMT
- Title: Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation
- Authors: Ivan Lopez, Fateme Nateghi Haredasht, Kaitlin Caoili, Jonathan H Chen, Akshay Chaudhari,
- Abstract summary: We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting.
Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text.
- Score: 4.684310901243605
- License:
- Abstract: Accurate classification of clinical text often requires fine-tuning pre-trained language models, a process that is costly and time-consuming due to the need for high-quality data and expert annotators. Synthetic data generation offers an alternative, though pre-trained models may not capture the syntactic diversity of clinical notes. We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting, generating synthetic text that better reflects clinical syntax. We evaluated this method using the CheXpert dataset on a classification task, comparing it to random few-shot and zero-shot approaches. Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text. Our pipeline reduced the data needed to reach the 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, while augmenting models with synthetic data improved AUROC by 57% and AUPRC by 68%. Additionally, our synthetic data was 0.9 times as effective as real data, a 60% improvement in value.
Related papers
- Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning [59.11519451499754]
Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences.
Recent work has shown DPO's effectiveness relies on training data quality.
We discover that reference model probability space naturally detects high-quality training samples.
arXiv Detail & Related papers (2025-01-25T07:21:50Z) - Embryo 2.0: Merging Synthetic and Real Data for Advanced AI Predictions [69.07284335967019]
We train two generative models using two datasets, one created and made publicly available, and one existing public dataset.
We generate synthetic embryo images at various cell stages, including 2-cell, 4-cell, 8-cell, morula, and blastocyst.
These were combined with real images to train classification models for embryo cell stage prediction.
arXiv Detail & Related papers (2024-12-02T08:24:49Z) - Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data? [8.775988650381397]
Training medical vision-language pre-training models requires datasets with paired, high-quality image-text data.
Recent advancements in Large Language Models have made it possible to generate large-scale synthetic image-text pairs.
We propose an automated pipeline to build a diverse, high-quality synthetic dataset.
arXiv Detail & Related papers (2024-10-17T13:11:07Z) - Enhancing Clinical Documentation with Synthetic Data: Leveraging Generative Models for Improved Accuracy [0.0]
This paper proposes a novel approach to augment clinical documentation by leveraging synthetic data generation techniques.
We present a methodology that combines state-of-the-art generative models, such as Generative Adrial Networks (GANs) and Variational Autoencoders (VAEs)
We demonstrate the effectiveness of our approach in generating high-quality synthetic transcripts that closely resemble real-world data.
arXiv Detail & Related papers (2024-06-03T15:49:03Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research [4.475998415951477]
Generative AI offers a promising approach to generating synthetic images, enhancing dataset diversity.
This study investigates the impact of synthetic data supplementation on the performance and generalizability of medical imaging research.
arXiv Detail & Related papers (2023-11-15T21:58:01Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Synthetic Augmentation with Large-scale Unconditional Pre-training [4.162192894410251]
We propose a synthetic augmentation method called HistoDiffusion to reduce the dependency on annotated data.
HistoDiffusion can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training.
We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets.
arXiv Detail & Related papers (2023-08-08T03:34:04Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - Medical Scientific Table-to-Text Generation with Human-in-the-Loop under
the Data Sparsity Constraint [11.720364723821993]
An efficient tableto-text summarization system can drastically reduce manual efforts to condense this data into reports.
However, in practice, the problem is heavily impeded by the data paucity, data sparsity and inability of the state-of-the-art natural language generation models to produce accurate and reliable outputs.
We propose a novel table-to-text approach and tackle these problems with a novel two-step architecture which is enhanced by auto-correction, copy mechanism and synthetic data augmentation.
arXiv Detail & Related papers (2022-05-24T21:10:57Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.