How much data do you need? Part 2: Predicting DL class specific training
dataset sizes
- URL: http://arxiv.org/abs/2403.06311v1
- Date: Sun, 10 Mar 2024 21:08:29 GMT
- Title: How much data do you need? Part 2: Predicting DL class specific training
dataset sizes
- Authors: Thomas M\"uhlenst\"adt, Jelena Frtunikj
- Abstract summary: This paper targets the question of predicting machine learning classification model performance.
It takes into account the number of training examples per class and not just the overall number of training examples.
An algorithm is suggested which is motivated from special cases of space filling design of experiments.
- Score: 0.38073142980732994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper targets the question of predicting machine learning classification
model performance, when taking into account the number of training examples per
class and not just the overall number of training examples. This leads to the a
combinatorial question, which combinations of number of training examples per
class should be considered, given a fixed overall training dataset size. In
order to solve this question, an algorithm is suggested which is motivated from
special cases of space filling design of experiments. The resulting data are
modeled using models like powerlaw curves and similar models, extended like
generalized linear models i.e. by replacing the overall training dataset size
by a parametrized linear combination of the number of training examples per
label class. The proposed algorithm has been applied on the CIFAR10 and the
EMNIST datasets.
Related papers
- Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - A Lightweight Measure of Classification Difficulty from Application Dataset Characteristics [4.220363193932374]
We propose an efficient classification difficulty measure that is calculated from the number of classes and intra- and inter-class similarity metrics of the dataset.
We show how this measure can help a practitioner select a computationally efficient model for a small dataset 6 to 29x faster than through repeated training and testing.
arXiv Detail & Related papers (2024-04-09T03:27:09Z) - 'One size doesn't fit all': Learning how many Examples to use for
In-Context Learning for Improved Text Classification [18.167541508658417]
In-context learning (ICL) uses a small number of labelled data instances as examples in the prompt.
We propose a novel methodology of dynamically adapting the number of examples as per the data.
Our experiments show that our AICL method results in improvement in text classification task on several standard datasets.
arXiv Detail & Related papers (2024-03-11T03:28:13Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - Measuring the Effect of Training Data on Deep Learning Predictions via
Randomized Experiments [5.625056584412003]
We develop a principled algorithm for estimating the contribution of training data points to a deep learning model.
Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data.
arXiv Detail & Related papers (2022-06-20T21:27:18Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data.
We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z) - Learning a Universal Template for Few-shot Dataset Generalization [25.132729497191047]
Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem.
We propose to utilize a diverse training set to construct a universal template that can define a wide array of dataset-specialized models.
Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves the state-of-the-art on the challenging Meta-Dataset benchmark.
arXiv Detail & Related papers (2021-05-14T18:46:06Z) - Estimating informativeness of samples with Smooth Unique Information [108.25192785062367]
We measure how much a sample informs the final weights and how much it informs the function computed by the weights.
We give efficient approximations of these quantities using a linearized network.
We apply these measures to several problems, such as dataset summarization.
arXiv Detail & Related papers (2021-01-17T10:29:29Z) - Data augmentation and feature selection for automatic model
recommendation in computational physics [0.0]
This article introduces two algorithms to address the lack of training data, their high dimensionality, and the non-applicability of common data augmentation techniques to physics data.
When combined with a stacking ensemble made of six multilayer perceptrons and a ridge logistic regression, they enable reaching an accuracy of 90% on our classification problem for nonlinear structural mechanics.
arXiv Detail & Related papers (2021-01-12T15:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.