A Survey on Dataset Distillation: Approaches, Applications and Future
Directions
- URL: http://arxiv.org/abs/2305.01975v3
- Date: Thu, 24 Aug 2023 14:50:34 GMT
- Title: A Survey on Dataset Distillation: Approaches, Applications and Future
Directions
- Authors: Jiahui Geng, Zongxiong Chen, Yuandou Wang, Herbert Woisetschlaeger,
Sonja Schimmler, Ruben Mayer, Zhiming Zhao and Chunming Rong
- Abstract summary: By synthesizing datasets with high information density, dataset distillation offers a range of potential applications.
We propose a taxonomy of dataset distillation, characterizing existing approaches, and then systematically reviewing the data modalities, and related applications.
- Score: 4.906549881313351
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataset distillation is attracting more attention in machine learning as
training sets continue to grow and the cost of training state-of-the-art models
becomes increasingly high. By synthesizing datasets with high information
density, dataset distillation offers a range of potential applications,
including support for continual learning, neural architecture search, and
privacy protection. Despite recent advances, we lack a holistic understanding
of the approaches and applications. Our survey aims to bridge this gap by first
proposing a taxonomy of dataset distillation, characterizing existing
approaches, and then systematically reviewing the data modalities, and related
applications. In addition, we summarize the challenges and discuss future
directions for this field of research.
Related papers
- A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models.
We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z) - Behaviour Distillation [10.437472004180883]
We formalize behaviour distillation, a setting that aims to discover and condense information required for training an expert policy into a synthetic dataset.
We then introduce Hallucinating datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs.
We show that these datasets generalize out of distribution to training policies with a wide range of architectures.
We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion.
arXiv Detail & Related papers (2024-06-21T10:45:43Z) - Deep Learning for Trajectory Data Management and Mining: A Survey and Beyond [58.63558696061679]
Trajectory computing is crucial in various practical applications such as location services, urban traffic, and public safety.
We present a review of development and recent advances in deep learning for trajectory computing (DL4Traj)
Notably, we encapsulate recent advancements in Large Language Models (LLMs) that hold potential to augment trajectory computing.
arXiv Detail & Related papers (2024-03-21T05:57:27Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power.
Deep learning technology has developed unprecedentedly in the last decade.
This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z) - Data Distillation: A Survey [32.718297871027865]
Deep learning has led to the curation of a vast number of massive and multifarious datasets.
Despite having close-to-human performance on individual tasks, training parameter-hungry models on large datasets poses multi-faceted problems.
Data distillation approaches aim to synthesize terse data summaries, which can serve as effective drop-in replacements of the original dataset.
arXiv Detail & Related papers (2023-01-11T02:25:10Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - Deep Learning Schema-based Event Extraction: Literature Review and
Current Trends [60.29289298349322]
Event extraction technology based on deep learning has become a research hotspot.
This paper fills the gap by reviewing the state-of-the-art approaches, focusing on deep learning-based models.
arXiv Detail & Related papers (2021-07-05T16:32:45Z) - Data and its (dis)contents: A survey of dataset development and use in
machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning.
We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.