Dataset Growth
- URL: http://arxiv.org/abs/2405.18347v2
- Date: Tue, 23 Jul 2024 07:31:18 GMT
- Title: Dataset Growth
- Authors: Ziheng Qin, Zhaopan Xu, Yukun Zhou, Zangwei Zheng, Zebang Cheng, Hao Tang, Lei Shang, Baigui Sun, Xiaojiang Peng, Radu Timofte, Hongxun Yao, Kai Wang, Yang You,
- Abstract summary: InfoGrowth is an efficient online algorithm for data cleaning and selection.
It can improve data quality/efficiency on both single-modal and multi-modal tasks.
- Score: 59.68869191071907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. However, these methods are mainly proposed for offline settings that target one of the cleanness and redundancy problems. In practice, data are growing exponentially with both problems. This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. InfoGrowth can improve data quality/efficiency on both single-modal and multi-modal tasks, with an efficient and scalable design. Its framework makes it practical for real-world data engines.
Related papers
- LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Active Data Acquisition in Autonomous Driving Simulation [0.0]
This paper proposes the concept of an active data-collecting strategy.
For high-quality data, increasing the collection density can improve the overall quality of the dataset.
arXiv Detail & Related papers (2023-06-24T10:07:35Z) - A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power.
Deep learning technology has developed unprecedentedly in the last decade.
This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z) - Advanced Data Augmentation Approaches: A Comprehensive Survey and Future
directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique.
We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z) - The Lean Data Scientist: Recent Advances towards Overcoming the Data
Bottleneck [16.18460753647167]
Machine learning (ML) is revolutionizing the world, affecting almost every field of science and industry.
Recent algorithms are increasingly data-hungry, requiring large datasets for training.
However, obtaining quality datasets of such magnitude proves to be a difficult challenge.
arXiv Detail & Related papers (2022-11-15T07:44:56Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - Autoencoder-based cleaning in probabilistic databases [0.0]
We propose a data-cleaning autoencoder capable of near-automatic data quality improvement.
It learns the structure and dependencies in the data to identify and correct doubtful values.
arXiv Detail & Related papers (2021-06-17T18:46:56Z) - Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes.
Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z) - Generative Low-bitwidth Data Free Quantization [44.613912463011545]
We propose Generative Low-bitwidth Data Free Quantization (GDFQ) to remove the data dependence burden.
With the help of generated data, we can quantize a model by learning knowledge from the pre-trained model.
Our method achieves much higher accuracy on 4-bit quantization than the existing data free quantization method.
arXiv Detail & Related papers (2020-03-07T16:38:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.