A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts
- URL: http://arxiv.org/abs/2505.03025v1
- Date: Mon, 05 May 2025 20:58:08 GMT
- Title: A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts
- Authors: Steven Bedrick, A. Seza Doğruöz, Sergiu Nisioi,
- Abstract summary: We provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain.<n>We propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.
- Score: 1.215281324470423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.
Related papers
- Data-Constrained Synthesis of Training Data for De-Identification [0.0]
We domain-adapt large language models (LLMs) to the clinical domain.<n>We generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information.<n>The synthetic corpora are then used to train synthetic NER models.
arXiv Detail & Related papers (2025-02-20T16:09:27Z) - SynSUM -- Synthetic Benchmark with Structured and Unstructured Medical Records [6.897301398584943]
We present the SynSUM benchmark, a synthetic dataset linking unstructured clinical notes to structured background variables.<n>The dataset consists of 10,000 artificial patient records containing a fictional patient encounter in the domain of respiratory diseases.
arXiv Detail & Related papers (2024-09-13T15:55:15Z) - A Novel Taxonomy for Navigating and Classifying Synthetic Data in Healthcare Applications [9.66493160220239]
This paper proposes a novel taxonomy of synthetic data in healthcare to navigate the landscape in terms of three main varieties.
Data Proportion comprises different ratios of synthetic data in a dataset and associated pros and cons.
Data Modality refers to the different data formats amenable to synthesis and format-specific challenges.
Data Transformation concerns improving specific aspects of a dataset like its utility or privacy with synthetic data.
arXiv Detail & Related papers (2024-09-01T12:04:03Z) - Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges [2.1835659964186087]
This paper presents a systematic review of generative models used to synthesize various medical data types.
Our study encompasses a broad array of medical data modalities and explores various generative models.
arXiv Detail & Related papers (2024-06-27T14:00:11Z) - Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models [46.32860360019374]
Large language models (LLMs) have shown promise in this domain, but their direct deployment can lead to privacy issues.<n>We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process.<n>Our empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks.
arXiv Detail & Related papers (2023-11-01T04:37:28Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study
on Telematics Data with ChatGPT [0.0]
This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT.
To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset.
arXiv Detail & Related papers (2023-06-23T15:15:13Z) - Synthetic Data in Healthcare [10.555189948915492]
We present the cases for physical and statistical simulations for creating data and the proposed applications in healthcare and medicine.
We discuss that while synthetics can promote privacy, equity, safety and continual and causal learning, they also run the risk of introducing flaws, blind spots and propagating or exaggerating biases.
arXiv Detail & Related papers (2023-04-06T17:23:39Z) - Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances [76.34037366117234]
We introduce a new dataset called Robot Control Gestures (RoCoG-v2)
The dataset is composed of both real and synthetic videos from seven gesture classes.
We present results using state-of-the-art action recognition and domain adaptation algorithms.
arXiv Detail & Related papers (2023-03-17T23:23:55Z) - PLACES: Prompting Language Models for Social Conversation Synthesis [103.94325597273316]
We use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting.
We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations.
arXiv Detail & Related papers (2023-02-07T05:48:16Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - GDPR Compliant Collection of Therapist-Patient-Dialogues [48.091760741427656]
We elaborate on the challenges we faced in starting our collection of therapist-patient dialogues in a psychiatry clinic under the General Data Privacy Regulation of the European Union.
We give an overview of each step in our procedure and point out the potential pitfalls to motivate further research in this field.
arXiv Detail & Related papers (2022-11-22T15:51:10Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.