Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates
- URL: http://arxiv.org/abs/2510.05805v2
- Date: Thu, 16 Oct 2025 18:34:15 GMT
- Title: Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates
- Authors: Pafue Christy Nganjimi, Andrew Soltan, Danielle Belgrave, Lei Clifton, David A. Clifton, Anshul Thakur,
- Abstract summary: State-of-the-art datasets condensation (DC) enables the creation of privacy-preserving synthetic datasets that can match the utility of real patient records.<n>DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data.<n>We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates.<n>These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage.
- Score: 15.665823714894605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic B\'ezier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify B\'ezier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.
Related papers
- ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation [9.230247128710865]
We propose a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep.<n>ManifoldGD improves representativeness, diversity, and image fidelity without requiring any model retraining.
arXiv Detail & Related papers (2026-02-26T18:07:10Z) - Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework [1.6270825077960571]
Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation.<n>We introduce a Hybrid Attention-Convolution architecture that combines Transformers to capture global vessel topology prior to a CNN.<n>To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches.
arXiv Detail & Related papers (2026-02-10T16:34:17Z) - Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration [56.074760766965085]
PRISM achieves a dynamics-aware framework that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge.<n>Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
arXiv Detail & Related papers (2026-01-12T05:43:20Z) - Grad-CL: Source Free Domain Adaptation with Gradient Guided Feature Disalignment [3.2371089062298317]
Grad-CL is a novel source-free domain adaptation framework.<n>It adapts segmentation performance without requiring access to original source data.<n>It outperforms state-of-the-art unsupervised and source-free domain adaptation methods.
arXiv Detail & Related papers (2025-09-12T10:51:46Z) - Latent Space Synergy: Text-Guided Data Augmentation for Direct Diffusion Biomedical Segmentation [2.4912767911151015]
We present SynDiff, a framework combining text-guided synthetic data generation with efficient diffusion-based segmentation.<n>Our approach employs latent diffusion models to generate clinically realistic synthetic polyps through text-conditioned inpainting.<n>On CVC-ClinicDB, SynDiff achieves 96.4% Dice and 92.9% IoU while maintaining real-time capability suitable for clinical deployment.
arXiv Detail & Related papers (2025-07-21T08:15:17Z) - Trajectory Consistency Distillation: Improved Latent Consistency Distillation by Semi-Linear Consistency Function with Trajectory Mapping [75.72212215739746]
Trajectory Consistency Distillation (TCD) encompasses trajectory consistency function and strategic sampling.
TCD significantly enhances image quality at low NFEs but also yields more detailed results compared to the teacher model.
arXiv Detail & Related papers (2024-02-29T13:44:14Z) - Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework.
We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels.
Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z) - Minimizing the Accumulated Trajectory Error to Improve Dataset
Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory.
We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory.
Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z) - Dataset Condensation via Efficient Synthetic-Data Parameterization [40.56817483607132]
Machine learning with massive amounts of data comes at a price of huge computation costs and storage for training and tuning.
Recent studies on dataset condensation attempt to reduce the dependence on such massive data by synthesizing a compact training dataset.
We propose a novel condensation framework that generates multiple synthetic data with a limited storage budget via efficient parameterization considering data regularity.
arXiv Detail & Related papers (2022-05-30T09:55:31Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - COVI-AgentSim: an Agent-based Model for Evaluating Methods of Digital
Contact Tracing [68.68882022019272]
COVI-AgentSim is an agent-based compartmental simulator based on virology, disease progression, social contact networks, and mobility patterns.
We use COVI-AgentSim to perform cost-adjusted analyses comparing no DCT to: 1) standard binary contact tracing (BCT) that assigns binary recommendations based on binary test results; and 2) a rule-based method for feature-based contact tracing (FCT) that assigns a graded level of recommendation based on diverse individual features.
arXiv Detail & Related papers (2020-10-30T00:47:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.