Label-Consistent Dataset Distillation with Detector-Guided Refinement
- URL: http://arxiv.org/abs/2507.13074v1
- Date: Thu, 17 Jul 2025 12:42:54 GMT
- Title: Label-Consistent Dataset Distillation with Detector-Guided Refinement
- Authors: Yawen Zou, Guang Li, Zi Wang, Chunzhi Gu, Chao Zhang,
- Abstract summary: We propose a detector-guided dataset distillation framework to generate compact yet informative datasets.<n>Our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.
- Score: 9.74050046377107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset distillation (DD) aims to generate a compact yet informative dataset that achieves performance comparable to the original dataset, thereby reducing demands on storage and computational resources. Although diffusion models have made significant progress in dataset distillation, the generated surrogate datasets often contain samples with label inconsistencies or insufficient structural detail, leading to suboptimal downstream performance. To address these issues, we propose a detector-guided dataset distillation framework that explicitly leverages a pre-trained detector to identify and refine anomalous synthetic samples, thereby ensuring label consistency and improving image quality. Specifically, a detector model trained on the original dataset is employed to identify anomalous images exhibiting label mismatches or low classification confidence. For each defective image, multiple candidates are generated using a pre-trained diffusion model conditioned on the corresponding image prototype and label. The optimal candidate is then selected by jointly considering the detector's confidence score and dissimilarity to existing qualified synthetic samples, thereby ensuring both label accuracy and intra-class diversity. Experimental results demonstrate that our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.
Related papers
- Enhancing Diffusion-based Dataset Distillation via Adversary-Guided Curriculum Sampling [22.21686398518648]
Adversary-guided Curriculum Sampling (ACS) partitions distilled dataset into multiple curricula.<n>ACS guides diffusion sampling process by an adversarial loss to challenge a discriminator trained on sampled images.<n>ACS achieves substantial improvements of 4.1% on Imagewoof and 2.1% on ImageNet-1k over the state-of-the-art.
arXiv Detail & Related papers (2025-08-02T08:48:32Z) - DiffDoctor: Diagnosing Image Diffusion Models Before Treating [57.82359018425674]
We propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts.<n>We collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process.<n>The learned artifact detector is then involved in the second stage to optimize the diffusion model by providing pixel-level feedback.
arXiv Detail & Related papers (2025-01-21T18:56:41Z) - A Bias-Free Training Paradigm for More General AI-generated Image Detection [15.421102443599773]
A well-designed forensic detector should detect generator specific artifacts rather than reflect data biases.<n>We propose B-Free, a bias-free training paradigm, where fake images are generated from real ones.<n>We show significant improvements in both generalization and robustness over state-of-the-art detectors.
arXiv Detail & Related papers (2024-12-23T15:54:32Z) - Heavy Labels Out! Dataset Distillation with Label Space Lightening [69.67681224137561]
HeLlO aims at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images.
We demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets.
arXiv Detail & Related papers (2024-08-15T15:08:58Z) - Visual Car Brand Classification by Implementing a Synthetic Image Dataset Creation Pipeline [3.524869467682149]
We propose an automatic pipeline for generating synthetic image datasets using Stable Diffusion.
We leverage YOLOv8 for automatic bounding box detection and quality assessment of synthesized images.
arXiv Detail & Related papers (2024-06-03T07:44:08Z) - DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets.
We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability.
Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z) - Model Selection of Anomaly Detectors in the Absence of Labeled Validation Data [18.233908098602114]
We propose SWSA: a framework to select image-based anomaly detectors without labeled validation data.
Instead of collecting labeled validation data, we generate synthetic anomalies without any training or fine-tuning.
Our synthetic anomalies are used to create detection tasks that compose a validation framework for model selection.
arXiv Detail & Related papers (2023-10-16T14:42:22Z) - Synthetic Augmentation with Large-scale Unconditional Pre-training [4.162192894410251]
We propose a synthetic augmentation method called HistoDiffusion to reduce the dependency on annotated data.
HistoDiffusion can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training.
We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets.
arXiv Detail & Related papers (2023-08-08T03:34:04Z) - Generative Modeling Helps Weak Supervision (and Vice Versa) [87.62271390571837]
We propose a model fusing weak supervision and generative adversarial networks.
It captures discrete variables in the data alongside the weak supervision derived label estimate.
It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels.
arXiv Detail & Related papers (2022-03-22T20:24:21Z) - Semi-supervised Salient Object Detection with Effective Confidence
Estimation [35.0990691497574]
We study semi-supervised salient object detection with access to a small number of labeled samples and a large number of unlabeled samples.
We model the nature of human saliency labels using the latent variable of the Conditional Energy-based Model.
With only 1/16 labeled samples, our model achieves competitive performance compared with state-of-the-art fully-supervised models.
arXiv Detail & Related papers (2021-12-28T07:14:48Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z) - Weakly Supervised-Based Oversampling for High Imbalance and High
Dimensionality Data Classification [2.9283685972609494]
Oversampling is an effective method to solve imbalanced classification.
Inaccurate labels of synthetic samples would distort the distribution of the dataset.
This paper introduces the idea of weakly supervised learning to handle the inaccurate labeling of synthetic samples.
arXiv Detail & Related papers (2020-09-29T15:26:34Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.