Synthetic Data for Veterinary EHR De-identification: Benefits, Limits, and Safety Trade-offs Under Fixed Compute
- URL: http://arxiv.org/abs/2601.09756v1
- Date: Tue, 13 Jan 2026 19:35:25 GMT
- Title: Synthetic Data for Veterinary EHR De-identification: Benefits, Limits, and Safety Trade-offs Under Fixed Compute
- Authors: David Brundage,
- Abstract summary: This study evaluates whether large language model (LLM)-generated synthetic narratives improve de-identification safety.<n>We generated 10,382 synthetic notes using a privacy-preserving "template-only" regime.<n>We conclude that synthetic augmentation is effective for expanding exposure but is complementary, not substitutive, for safety-critical veterinary de-identification.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Veterinary electronic health records (vEHRs) contain privacy-sensitive identifiers that limit secondary use. While PetEVAL provides a benchmark for veterinary de-identification, the domain remains low-resource. This study evaluates whether large language model (LLM)-generated synthetic narratives improve de-identification safety under distinct training regimes, emphasizing (i) synthetic augmentation and (ii) fixed-budget substitution. We conducted a controlled simulation using a PetEVAL-derived corpus (3,750 holdout/1,249 train). We generated 10,382 synthetic notes using a privacy-preserving "template-only" regime where identifiers were removed prior to LLM prompting. Three transformer backbones (PetBERT, VetBERT, Bio_ClinicalBERT) were trained under varying mixtures. Evaluation prioritized document-level leakage rate (the fraction of documents with at least one missed identifier) as the primary safety outcome. Results show that under fixed-sample substitution, replacing real notes with synthetic ones monotonically increased leakage, indicating synthetic data cannot safely replace real supervision. Under compute-matched training, moderate synthetic mixing matched real-only performance, but high synthetic dominance degraded utility. Conversely, epoch-scaled augmentation improved performance: PetBERT span-overlap F1 increased from 0.831 to 0.850 +/- 0.014, and leakage decreased from 6.32% to 4.02% +/- 0.19%. However, these gains largely reflect increased training exposure rather than intrinsic synthetic data quality. Corpus diagnostics revealed systematic synthetic-real mismatches in note length and label distribution that align with persistent leakage. We conclude that synthetic augmentation is effective for expanding exposure but is complementary, not substitutive, for safety-critical veterinary de-identification.
Related papers
- Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity [43.338311770275745]
We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes.<n>We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split.<n>For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol.
arXiv Detail & Related papers (2026-02-20T03:02:36Z) - Generating High-quality Privacy-preserving Synthetic Data [0.0]
We study a model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off.<n>We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder.<n>We evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income.
arXiv Detail & Related papers (2026-02-06T05:03:49Z) - Boosting Overlapping Organoid Instance Segmentation Using Pseudo-Label Unmixing and Synthesis-Assisted Learning [17.2312303630893]
Organoids are sophisticated in vitro models of human tissues.<n>They are crucial for medical research due to their ability to simulate organ functions and assess drug responses accurately.<n> Accurate organoid instance segmentation is critical for quantifying their dynamic behaviors.
arXiv Detail & Related papers (2026-01-10T17:51:09Z) - Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench [48.60251555171943]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities.<n>In this work, we quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors.<n>We introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies.<n>Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic
arXiv Detail & Related papers (2026-01-06T05:58:50Z) - One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training [45.49415063761575]
EndoRare is a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image.<n>We validated the framework across four rare pathologies.<n>These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.
arXiv Detail & Related papers (2025-12-30T15:07:09Z) - CONFIDE: Hallucination Assessment for Reliable Biomolecular Structure Prediction and Design [46.12506067241116]
We present CODE (Chain of Diffusion Embeddings), a self evaluating metric to quantify topological frustration.<n>We propose CONFIDE, a unified evaluation framework that combines energetic and topological perspectives.<n>By combining data driven embeddings with theoretical insight, CODE and CONFIDE outperform existing metrics across a wide range of biomolecular systems.
arXiv Detail & Related papers (2025-11-20T03:38:46Z) - Predictive Modeling and Explainable AI for Veterinary Safety Profiles, Residue Assessment, and Health Outcomes Using Real-World Data and Physicochemical Properties [4.53318808068234]
Adverse events (AEs) may signal unexpected or toxicokinetic effects, increasing the risk of violative residues in the food chain.<n>This study introduces a predictive framework for classifying outcomes (Death vs. Recovery) using 1.28 million reports from the U.S. FDA's OpenFDA Center for Veterinary Medicine.
arXiv Detail & Related papers (2025-10-01T23:34:46Z) - <think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs [60.169913160819]
This paper explores the possibility of using synthetic toxic data as an alternative to human-generated data for training models for detoxification.<n>Experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data.<n>The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity.
arXiv Detail & Related papers (2025-09-10T07:48:24Z) - Improving Heart Rejection Detection in XPCI Images Using Synthetic Data Augmentation [0.0]
StyleGAN was trained on available 3R biopsy patches and subsequently used to generate 10,000 realistic synthetic images.<n>These were combined with real 0R samples, that is samples without rejection, in various configurations to train ResNet-18 classifiers for binary rejection classification.<n>Results demonstrate that synthetic data improves classification performance, particularly when used in combination with real samples.
arXiv Detail & Related papers (2025-05-26T09:26:36Z) - SAU: A Dual-Branch Network to Enhance Long-Tailed Recognition via Generative Models [9.340077455871736]
Long-tailed distributions in image recognition pose a considerable challenge due to the severe imbalance between a few dominant classes.
Recently, the use of large generative models to create synthetic data for image classification has been realized.
We propose the use of synthetic data as a complement to long-tailed datasets to eliminate the impact of data imbalance.
arXiv Detail & Related papers (2024-08-29T05:33:59Z) - FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for
Abstractive Summarization [91.46015013816083]
We present FactPEG, an abstractive summarization model that addresses the problem of factuality during pre-training and fine-tuning.
Our analysis suggests FactPEG is more factual than using the original pre-training objective in zero-shot and fewshot settings.
arXiv Detail & Related papers (2022-05-16T17:39:14Z) - Improved Certified Defenses against Data Poisoning with (Deterministic)
Finite Aggregation [122.83280749890078]
We propose an improved certified defense against general poisoning attacks, namely Finite Aggregation.
In contrast to DPA, which directly splits the training set into disjoint subsets, our method first splits the training set into smaller disjoint subsets.
We offer an alternative view of our method, bridging the designs of deterministic and aggregation-based certified defenses.
arXiv Detail & Related papers (2022-02-05T20:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.