A Data-Centric AI Paradigm Based on Application-Driven Fine-grained
Dataset Design
- URL: http://arxiv.org/abs/2209.09449v1
- Date: Tue, 20 Sep 2022 03:56:53 GMT
- Title: A Data-Centric AI Paradigm Based on Application-Driven Fine-grained
Dataset Design
- Authors: Huan Hu, Yajie Cui, Zhaoxiang Liu and Shiguo Lian
- Abstract summary: We propose a novel paradigm for fine-grained design of datasets, driven by industrial applications.
We flexibly select positive and negative sample sets according to the essential features of the data and application requirements.
Compared with the traditional data design methods, our method achieves better results and effectively reduces false alarm.
- Score: 2.2223262422197907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning has a wide range of applications in industrial scenario, but
reducing false alarm (FA) remains a major difficulty. Optimizing network
architecture or network parameters is used to tackle this challenge in academic
circles, while ignoring the essential characteristics of data in application
scenarios, which often results in increased FA in new scenarios. In this paper,
we propose a novel paradigm for fine-grained design of datasets, driven by
industrial applications. We flexibly select positive and negative sample sets
according to the essential features of the data and application requirements,
and add the remaining samples to the training set as uncertainty classes. We
collect more than 10,000 mask-wearing recognition samples covering various
application scenarios as our experimental data. Compared with the traditional
data design methods, our method achieves better results and effectively reduces
FA. We make all contributions available to the research community for broader
use. The contributions will be available at
https://github.com/huh30/OpenDatasets.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient [52.2669490431145]
PropEn is inspired by'matching', which enables implicit guidance without training a discriminator.
We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution.
arXiv Detail & Related papers (2024-05-28T11:30:19Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Data Augmentations in Deep Weight Spaces [89.45272760013928]
We introduce a novel augmentation scheme based on the Mixup method.
We evaluate the performance of these techniques on existing benchmarks as well as new benchmarks we generate.
arXiv Detail & Related papers (2023-11-15T10:43:13Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - DATED: Guidelines for Creating Synthetic Datasets for Engineering Design
Applications [3.463438487417909]
This study proposes comprehensive guidelines for generating, annotating, and validating synthetic datasets.
The study underscores the importance of thoughtful sampling methods to ensure the appropriate size, diversity, utility, and realism of a dataset.
Overall, this paper offers valuable insights for researchers intending to create and publish synthetic datasets for engineering design.
arXiv Detail & Related papers (2023-05-15T21:00:09Z) - A Survey on Deep Industrial Transfer Learning in Fault Prognostics [0.0]
This paper aims at establishing best practices for future research in this field.
It is shown that the field is lacking common benchmarks to robustly compare results and facilitate scientific progress.
The data sets utilized in these publications are surveyed as well in order to identify suitable candidates for such benchmark scenarios.
arXiv Detail & Related papers (2023-01-04T17:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.