Related papers: Data Optimization in Deep Learning: A Survey

Data Optimization in Deep Learning: A Survey

URL: http://arxiv.org/abs/2310.16499v1
Date: Wed, 25 Oct 2023 09:33:57 GMT
Title: Data Optimization in Deep Learning: A Survey
Authors: Ou Wu and Rujing Yao
Abstract summary: This study aims to organize a wide range of existing data optimization methodologies for deep learning. The constructed taxonomy considers the diversity of split dimensions, and deep sub-taxonomies are constructed for each dimension. The constructed taxonomy and the revealed connections will enlighten the better understanding of existing methods and the design of novel data optimization techniques.
Score: 3.1274367448459253
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale, high-quality data are considered an essential factor for the successful application of many deep learning techniques. Meanwhile, numerous real-world deep learning tasks still have to contend with the lack of sufficient amounts of high-quality data. Additionally, issues such as model robustness, fairness, and trustworthiness are also closely related to training data. Consequently, a huge number of studies in the existing literature have focused on the data aspect in deep learning tasks. Some typical data optimization techniques include data augmentation, logit perturbation, sample weighting, and data condensation. These techniques usually come from different deep learning divisions and their theoretical inspirations or heuristic motivations may seem unrelated to each other. This study aims to organize a wide range of existing data optimization methodologies for deep learning from the previous literature, and makes the effort to construct a comprehensive taxonomy for them. The constructed taxonomy considers the diversity of split dimensions, and deep sub-taxonomies are constructed for each dimension. On the basis of the taxonomy, connections among the extensive data optimization methods for deep learning are built in terms of four aspects. We probe into rendering several promising and interesting future directions. The constructed taxonomy and the revealed connections will enlighten the better understanding of existing methods and the design of novel data optimization techniques. Furthermore, our aspiration for this survey is to promote data optimization as an independent subdivision of deep learning. A curated, up-to-date list of resources related to data optimization in deep learning is available at \url{https://github.com/YaoRujing/Data-Optimization}.

Related papers

A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models. We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z)
A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples. Existing literature surveys only focus on a certain type of specific modality data. We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
Deep networks for system identification: a Survey [56.34005280792013]
System identification learns mathematical descriptions of dynamic systems from input-output data. Main aim of the identified model is to predict new data from previous observations. We discuss architectures commonly adopted in the literature, like feedforward, convolutional, and recurrent networks.
arXiv Detail & Related papers (2023-01-30T12:38:31Z)
A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power. Deep learning technology has developed unprecedentedly in the last decade. This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z)
A Survey of Learning on Small Data: Generalization, Optimization, and Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI. This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data. Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z)
Data Augmentation techniques in time series domain: A survey and taxonomy [0.20971479389679332]
Deep neural networks used to work with time series heavily depend on the size and consistency of the datasets used in training. This work systematically reviews the current state-of-the-art in the area to provide an overview of all available algorithms. The ultimate aim of this study is to provide a summary of the evolution and performance of areas that produce better results to guide future researchers in this field.
arXiv Detail & Related papers (2022-06-25T17:09:00Z)
Deep Depth Completion: A Survey [26.09557446012222]
We provide a comprehensive literature review that helps readers better grasp the research trends and clearly understand the current advances. We investigate the related studies from the design aspects of network architectures, loss functions, benchmark datasets, and learning strategies. We present a quantitative comparison of model performance on two widely used benchmark datasets, including an indoor and an outdoor dataset.
arXiv Detail & Related papers (2022-05-11T08:24:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.