DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets
- URL: http://arxiv.org/abs/2502.00196v1
- Date: Fri, 31 Jan 2025 22:26:33 GMT
- Title: DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets
- Authors: Abdurrahim Yilmaz, Furkan Yuceyalcin, Ece Gokyayla, Donghee Choi, Ozan Erdem Ali Anil Demircali, Rahmetullah Varol, Ufuk Gorkem Kirabali, Gulsum Gencoglan, Joram M. Posma, Burak Temelkuran,
- Abstract summary: Derma Synth is a dataset of 92,020 synthetic image--text pairs curated from 45,205 images.
We leverage state-of-the-art vision large language models, using Gemini 2.0, to generate diverse and rich synthetic texts.
- Score: 0.9365295908188248
- License:
- Abstract: A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image--text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image--text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically related prompts and self-instruct method to generate diverse and rich synthetic texts. Metadata of the datasets were incorporated into the input prompts by targeting to reduce potential hallucinations. The resulting dataset builds upon open access dermatological image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We anticipate this dataset to support and accelerate AI research in dermatology. Data and code underlying this work are accessible at https://github.com/abdurrahimyilmaz/DermaSynth.
Related papers
- Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation [79.71072337496351]
CoSyn is a framework that creates synthetic text-rich multimodal data.
It can generate high-quality instruction-tuning data.
It can also produce synthetic pointing data, enabling vision-language models to ground information within input images.
arXiv Detail & Related papers (2025-02-20T18:55:30Z) - BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.
Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles.
BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z) - Cancer-Net SCa-Synth: An Open Access Synthetically Generated 2D Skin Lesion Dataset for Skin Cancer Classification [65.83291923029985]
In the United States, skin cancer ranks as the most commonly diagnosed cancer, presenting a significant public health issue.
Recent advancements in dataset curation and deep learning have shown promise in quick and accurate detection of skin cancer.
Cancer-Net SCa- Synth is an open access synthetically generated 2D skin lesion dataset for skin cancer classification.
arXiv Detail & Related papers (2024-11-08T02:04:21Z) - MedSyn: LLM-based Synthetic Medical Text Generation Framework [0.27376226833693]
We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph.
We use MKG to sample prior medical information for the prompt and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models.
Our research indicates that synthetic data can increase the classification accuracy of vital and challenging codes by up to 17.8% compared to settings without synthetic data.
arXiv Detail & Related papers (2024-08-04T15:07:44Z) - SkinCAP: A Multi-modal Dermatology Dataset Annotated with Rich Medical Captions [17.803181915074706]
SkinCAP comprises 4,000 images sourced from the Fitzpatrick 17k skin disease dataset and the Diverse Dermatology Images dataset.
Notably, SkinCAP represents the world's first such dataset and is publicly available at https://huggingface.co/datasets/joshuachou/SkinCAP.
arXiv Detail & Related papers (2024-05-28T09:48:23Z) - SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training? [57.42016037768947]
We present SynthCLIP, a CLIP model trained on entirely synthetic text-image pairs.
We generate synthetic datasets of images and corresponding captions at scale, with no human intervention.
arXiv Detail & Related papers (2024-02-02T18:59:58Z) - Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images [9.86468773903613]
Medical Vision-Language Pre-training learns representations jointly from medical images and paired radiology reports.
We replace real medical images with their synthetic equivalents, generated from authentic medical reports.
Our empirical evaluation reveals that the performance achieved through synthetic data is on par with or even exceeds that obtained with real images.
arXiv Detail & Related papers (2023-10-10T21:29:41Z) - WSSS4LUAD: Grand Challenge on Weakly-supervised Tissue Semantic
Segmentation for Lung Adenocarcinoma [51.50991881342181]
This challenge includes 10,091 patch-level annotations and over 130 million labeled pixels.
First place team achieved mIoU of 0.8413 (tumor: 0.8389, stroma: 0.7931, normal: 0.8919)
arXiv Detail & Related papers (2022-04-13T15:27:05Z) - Scientific Language Models for Biomedical Knowledge Base Completion: An
Empirical Study [62.376800537374024]
We study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction.
We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance.
arXiv Detail & Related papers (2021-06-17T17:55:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.