Related papers: Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning

Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning

URL: http://arxiv.org/abs/2409.01410v1
Date: Mon, 2 Sep 2024 18:11:15 GMT
Title: Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning
Authors: Vyacheslav Kungurtsev, Yuanfang Peng, Jianyang Gu, Saeed Vahidian, Anthony Quinn, Fadwa Idlahcen, Yiran Chen,
Abstract summary: We argue that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest. Our formalization reveals novel applications of DD across different modeling environments. We present numerical results for two case studies important in contemporary settings.
Score: 10.116674195405126
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dataset distillation (DD) is an increasingly important technique that focuses on constructing a synthetic dataset capable of capturing the core information in training data to achieve comparable performance in models trained on the latter. While DD has a wide range of applications, the theory supporting it is less well evolved. New methods of DD are compared on a common set of benchmarks, rather than oriented towards any particular learning task. In this work, we present a formal model of DD, arguing that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest. Without this task-specific focus, the DD problem is under-specified, and the selection of a DD algorithm for a particular task is merely heuristic. Our formalization reveals novel applications of DD across different modeling environments. We analyze existing DD methods through this broader lens, highlighting their strengths and limitations in terms of accuracy and faithfulness to optimal DD operation. Finally, we present numerical results for two case studies important in contemporary settings. Firstly, we address a critical challenge in medical data analysis: merging the knowledge from different datasets composed of intersecting, but not identical, sets of features, in order to construct a larger dataset in what is usually a small sample setting. Secondly, we consider out-of-distribution error across boundary conditions for physics-informed neural networks (PINNs), showing the potential for DD to provide more physically faithful data. By establishing this general formulation of DD, we aim to establish a new research paradigm by which DD can be understood and from which new DD techniques can arise.

Related papers

CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation [71.52209438343928]
Core Distribution Alignment (CoDA) is a framework that enables effective Distillation (DD) using only an off-the-shelf text-to-image model.<n>Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism.<n>By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics.
arXiv Detail & Related papers (2025-12-03T14:45:57Z)
Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics [68.85010825225528]
Video datasets present unique challenges due to the presence of temporal information and varying levels of redundancy across different classes.<n>Existing DD approaches assume a uniform level of temporal redundancy across all different video semantics, which limits their effectiveness on video datasets.<n>We propose Dynamic-Aware Video Distillation (DAViD), a Reinforcement Learning (RL) approach to predict the optimal Temporal Resolution of the synthetic videos.
arXiv Detail & Related papers (2025-05-28T11:43:58Z)
DistDD: Distributed Data Distillation Aggregation through Gradient Matching [14.132062317010847]
DistDD is a novel approach within the federated learning framework that reduces the need for repetitive communication by distilling data directly on clients' devices. We provide a detailed convergence proof of the DistDD algorithm, reinforcing its mathematical stability and reliability for practical applications.
arXiv Detail & Related papers (2024-10-11T09:43:35Z)
Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation [57.6797306341115]
We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty. We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods. We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
arXiv Detail & Related papers (2024-08-22T15:20:32Z)
Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD) RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals. Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z)
Exploring the Impact of Dataset Bias on Dataset Distillation [10.742404631413029]
We investigate the influence of dataset bias on Dataset Distillation (DD) DD is a technique to synthesize a smaller dataset that preserves essential information from the original dataset. Experiments demonstrate that biases present in the original dataset significantly impact the performance of the synthetic dataset.
arXiv Detail & Related papers (2024-03-24T06:10:22Z)
Can pre-trained models assist in dataset distillation? [21.613468512330442]
Pre-trained Models (PTMs) function as knowledge repositories, containing extensive information from the original dataset. This naturally raises a question: Can PTMs effectively transfer knowledge to synthetic datasets, guiding DD accurately? We systematically study different options in PTMs, including initialization parameters, model architecture, training epoch and domain knowledge.
arXiv Detail & Related papers (2023-10-05T03:51:21Z)
Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z)
Deep Unsupervised Domain Adaptation: A Review of Recent Advances and Perspectives [16.68091981866261]
Unsupervised domain adaptation (UDA) is proposed to counter the performance drop on data in a target domain. UDA has yielded promising results on natural image processing, video analysis, natural language processing, time-series data analysis, medical image analysis, etc.
arXiv Detail & Related papers (2022-08-15T20:05:07Z)
Dual-Teacher: Integrating Intra-domain and Inter-domain Teachers for Annotation-efficient Cardiac Segmentation [65.81546955181781]
We propose a novel semi-supervised domain adaptation approach, namely Dual-Teacher. The student model learns the knowledge of unlabeled target data and labeled source data by two teacher models. We demonstrate that our approach is able to concurrently utilize unlabeled data and cross-modality data with superior performance.
arXiv Detail & Related papers (2020-07-13T10:00:44Z)
Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data [87.61504710345528]
We propose two strategies for freeing a neural network from tuning with OoD data, while improving its OoD detection performance. We specifically propose to decompose confidence scoring as well as a modified input pre-processing method. Our further analysis on a larger scale image dataset shows that the two types of distribution shifts, specifically semantic shift and non-semantic shift, present a significant difference.
arXiv Detail & Related papers (2020-02-26T04:18:25Z)
Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim. We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting. Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.