Related papers: Data-Efficient Generation for Dataset Distillation

Data-Efficient Generation for Dataset Distillation

URL: http://arxiv.org/abs/2409.03929v1
Date: Thu, 5 Sep 2024 22:31:53 GMT
Title: Data-Efficient Generation for Dataset Distillation
Authors: Zhe Li, Weitong Zhang, Sarah Cechnicka, Bernhard Kainz,
Abstract summary: We train a conditional latent diffusion model capable of generating realistic synthetic images with labels. We demonstrate that models can be effectively trained using only a small set of synthetic images and evaluated on a large real test set.
Score: 12.106527496044473
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While deep learning techniques have proven successful in image-related tasks, the exponentially increased data storage and computation costs become a significant challenge. Dataset distillation addresses these challenges by synthesizing only a few images for each class that encapsulate all essential information. Most current methods focus on matching. The problems lie in the synthetic images not being human-readable and the dataset performance being insufficient for downstream learning tasks. Moreover, the distillation time can quickly get out of bounds when the number of synthetic images per class increases even slightly. To address this, we train a class conditional latent diffusion model capable of generating realistic synthetic images with labels. The sampling time can be reduced to several tens of images per seconds. We demonstrate that models can be effectively trained using only a small set of synthetic images and evaluated on a large real test set. Our approach achieved rank \(1\) in The First Dataset Distillation Challenge at ECCV 2024 on the CIFAR100 and TinyImageNet datasets.

Related papers

LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance [96.6544564242316]
We introduce a novel dataset generation framework named LoFT, LoRA-Fused Training-data Generation with Few-shot Guidance.<n>Our method fine-tunes LoRA weights on individual real images and fuses them at inference time, producing synthetic images that combine the features of real images for improved diversity and fidelity of generated data.<n>Our experiments show that training on LoFT-generated data consistently outperforms other synthetic dataset methods, significantly increasing accuracy as the dataset size increases.
arXiv Detail & Related papers (2025-05-16T21:17:55Z)
CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI [58.35348718345307]
Current efforts to distinguish between real and AI-generated images may lack generalization. We propose a novel framework, Co-Spy, that first enhances existing semantic features. We also create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models.
arXiv Detail & Related papers (2025-03-24T01:59:29Z)
Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation [53.95204595640208]
Data-Free Knowledge Distillation (DFKD) is an advanced technique that enables knowledge transfer from a teacher model to a student model without relying on original training data. Previous approaches have generated synthetic images at high resolutions without leveraging information from real images. MUSE generates images at lower resolutions while using Class Activation Maps (CAMs) to ensure that the generated images retain critical, class-specific features.
arXiv Detail & Related papers (2024-11-26T02:23:31Z)
Scaling Backwards: Minimal Synthetic Pre-training? [52.78699562832907]
We show that pre-training is effective even with minimal synthetic images. We find that a substantial reduction of synthetic images from 1k to 1 can even lead to an increase in pre-training performance. We extend our method from synthetic images to real images to see if a single real image can show similar pre-training effect.
arXiv Detail & Related papers (2024-08-01T16:20:02Z)
Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data [40.37396692278567]
We focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. We find that this approach surprisingly fails in true zero-shot settings when using contrastive losses.
arXiv Detail & Related papers (2024-04-25T14:24:41Z)
Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality [78.6359306550245]
We argue that using just one synthetic subset for distillation will not yield optimal generalization performance. PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets. Our experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%.
arXiv Detail & Related papers (2023-10-10T20:04:44Z)
DataDAM: Efficient Dataset Distillation with Attention Matching [15.300968899043498]
Researchers have long tried to minimize training costs in deep learning by maintaining strong generalization across diverse datasets. Emerging research on dataset aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset. However, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data.
arXiv Detail & Related papers (2023-09-29T19:07:48Z)
Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts. We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data. Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z)
Synthetic Data for Object Classification in Industrial Applications [53.180678723280145]
In object classification, capturing a large number of images per object and in different conditions is not always possible. This work explores the creation of artificial images using a game engine to cope with limited data in the training dataset.
arXiv Detail & Related papers (2022-12-09T11:43:04Z)
PennSyn2Real: Training Object Recognition Models without Human Labeling [12.923677573437699]
We propose PennSyn2Real - a synthetic dataset consisting of more than 100,000 4K images of more than 20 types of micro aerial vehicles (MAVs) The dataset can be used to generate arbitrary numbers of training images for high-level computer vision tasks such as MAV detection and classification. We show that synthetic data generated using this framework can be directly used to train CNN models for common object recognition tasks such as detection and segmentation.
arXiv Detail & Related papers (2020-09-22T02:53:40Z)
Syn2Real Transfer Learning for Image Deraining using Gaussian Processes [92.15895515035795]
CNN-based methods for image deraining have achieved excellent performance in terms of reconstruction error as well as visual quality. Due to challenges in obtaining real world fully-labeled image deraining datasets, existing methods are trained only on synthetically generated data. We propose a Gaussian Process-based semi-supervised learning framework which enables the network in learning to derain using synthetic dataset.
arXiv Detail & Related papers (2020-06-10T00:33:18Z)
Can Synthetic Data Improve Object Detection Results for Remote Sensing Images? [15.466412729455874]
We propose the use of realistic synthetic data with a wide distribution to improve the performance of remote sensing image aircraft detection. We randomly set the parameters during rendering, such as the size of the instance and the class of background images. In order to make the synthetic images more realistic, we refine the synthetic images at the pixel level using CycleGAN with real unlabeled images.
arXiv Detail & Related papers (2020-06-09T02:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.