Related papers: A Survey of Dataset Refinement for Problems in Computer Vision Datasets

A Survey of Dataset Refinement for Problems in Computer Vision Datasets

URL: http://arxiv.org/abs/2210.11717v2
Date: Fri, 6 Oct 2023 15:17:59 GMT
Title: A Survey of Dataset Refinement for Problems in Computer Vision Datasets
Authors: Zhijing Wan, Zhixiang Wang, CheukTing Chung and Zheng Wang
Abstract summary: Large-scale datasets have played a crucial role in the advancement of computer vision. They often suffer from problems such as class imbalance, noisy labels, dataset bias, or high resource costs. Various data-centric solutions have been proposed to solve the dataset problems. They improve the quality of datasets by re-organizing them, which we call dataset refinement.
Score: 11.45536223418548
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale datasets have played a crucial role in the advancement of computer vision. However, they often suffer from problems such as class imbalance, noisy labels, dataset bias, or high resource costs, which can inhibit model performance and reduce trustworthiness. With the advocacy of data-centric research, various data-centric solutions have been proposed to solve the dataset problems mentioned above. They improve the quality of datasets by re-organizing them, which we call dataset refinement. In this survey, we provide a comprehensive and structured overview of recent advances in dataset refinement for problematic computer vision datasets. Firstly, we summarize and analyze the various problems encountered in large-scale computer vision datasets. Then, we classify the dataset refinement algorithms into three categories based on the refinement process: data sampling, data subset selection, and active learning. In addition, we organize these dataset refinement methods according to the addressed data problems and provide a systematic comparative description. We point out that these three types of dataset refinement have distinct advantages and disadvantages for dataset problems, which informs the choice of the data-centric method appropriate to a particular research objective. Finally, we summarize the current literature and propose potential future research topics.

Related papers

From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets [19.140541190998842]
Software defect datasets are collections of software bugs and their associated information. Over the years, numerous software defect datasets have been developed, providing rich resources for the community. This article provides a comprehensive survey of 132 software defect datasets.
arXiv Detail & Related papers (2025-04-24T23:07:04Z)
Unreflected Use of Tabular Data Repositories Can Undermine Research Quality [41.71226316878786]
We argue that the unreflected usage of datasets from data repositories may have led to reduced research quality and scientific rigor. Our illustrations help users of data repositories avoid falling into the traps of (1) using suboptimal model selection strategies, (2) overlooking strong baselines, and (3) inappropriate preprocessing.
arXiv Detail & Related papers (2025-03-12T08:41:49Z)
A Guide to Misinformation Detection Datasets [5.673951146506489]
This guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations. All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.
arXiv Detail & Related papers (2024-11-07T18:47:39Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models [79.65071553905021]
We propose Data Advisor, a method for generating data that takes into account the characteristics of the desired dataset. Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation.
arXiv Detail & Related papers (2024-10-07T17:59:58Z)
Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models [36.22392593103493]
Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets. Existing surveys overlook an in-depth exploration of the fine-tuning phase. We introduce a novel three-stage scheme - comprising feature extraction, criteria design, and selector evaluation - to systematically categorize and evaluate these methods.
arXiv Detail & Related papers (2024-06-20T08:58:58Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
Exploring Dataset-Scale Indicators of Data Quality [23.017200605976807]
Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs. Recent research has suggested that improving data quality can significantly reduce the need for data quantity. We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents.
arXiv Detail & Related papers (2023-11-07T14:14:32Z)
Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces [0.30458514384586394]
We show the profound impact of dataset quality on model training and performance. Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality. This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
arXiv Detail & Related papers (2023-06-27T11:33:31Z)
Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT [0.0]
This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT. To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset.
arXiv Detail & Related papers (2023-06-23T15:15:13Z)
A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power. Deep learning technology has developed unprecedentedly in the last decade. This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z)
Advanced Data Augmentation Approaches: A Comprehensive Survey and Future directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique. We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z)
Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting. We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration. Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z)
Bringing the People Back In: Contesting Benchmark Machine Learning Datasets [11.00769651520502]
We outline a research program - a genealogy of machine learning data - for investigating how and why these datasets have been created. We describe the ways in which benchmark datasets in machine learning operate as infrastructure and pose four research questions for these datasets.
arXiv Detail & Related papers (2020-07-14T23:22:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.