Related papers: How much data do you need? Part 2: Predicting DL class specific training dataset sizes

How much data do you need? Part 2: Predicting DL class specific training dataset sizes

URL: http://arxiv.org/abs/2403.06311v1
Date: Sun, 10 Mar 2024 21:08:29 GMT
Title: How much data do you need? Part 2: Predicting DL class specific training dataset sizes
Authors: Thomas M\"uhlenst\"adt, Jelena Frtunikj
Abstract summary: This paper targets the question of predicting machine learning classification model performance. It takes into account the number of training examples per class and not just the overall number of training examples. An algorithm is suggested which is motivated from special cases of space filling design of experiments.
Score: 0.38073142980732994
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper targets the question of predicting machine learning classification model performance, when taking into account the number of training examples per class and not just the overall number of training examples. This leads to the a combinatorial question, which combinations of number of training examples per class should be considered, given a fixed overall training dataset size. In order to solve this question, an algorithm is suggested which is motivated from special cases of space filling design of experiments. The resulting data are modeled using models like powerlaw curves and similar models, extended like generalized linear models i.e. by replacing the overall training dataset size by a parametrized linear combination of the number of training examples per label class. The proposed algorithm has been applied on the CIFAR10 and the EMNIST datasets.

Related papers

Learning to Weight Parameters for Data Attribution [63.753710512888965]
We study data attribution in generative models, aiming to identify which training examples most influence a given output.<n>We propose a method that models this by learning parameter importance weights tailored for attribution, without requiring labeled data.
arXiv Detail & Related papers (2025-06-06T00:32:04Z)
The interplay between domain specialization and model size [8.653321928148547]
We investigate the interplay between domain and model size during continued pretraining under compute-constrained scenarios. Our goal is to identify an optimal training regime for this scenario and detect patterns in this interplay that can be generalized across different model sizes and domains.
arXiv Detail & Related papers (2025-01-03T19:28:53Z)
Test-Time Alignment via Hypothesis Reweighting [56.71167047381817]
Large pretrained models often struggle with underspecified tasks. We propose a novel framework to address the challenge of aligning models to test-time user intent.
arXiv Detail & Related papers (2024-12-11T23:02:26Z)
Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling [21.762562172089236]
We build specialist models from large generalist training sets instead. We adjust the training distribution of the generalist data with guidance from the limited domain-specific data. It is scalable, suitable for pretraining and continued pretraining, it works well in multi-task settings.
arXiv Detail & Related papers (2024-09-30T20:49:54Z)
Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. Existing approaches require re-training models on different data subsets, which is computationally intensive. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z)
A Lightweight Measure of Classification Difficulty from Application Dataset Characteristics [4.220363193932374]
We propose an efficient cosine similarity-based classification difficulty measure S. It is calculated from the number of classes and intra- and inter-class similarity metrics of the dataset. We show how a practitioner can use this measure to help select an efficient model 6 to 29x faster than through repeated training and testing.
arXiv Detail & Related papers (2024-04-09T03:27:09Z)
TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z)
Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments [5.625056584412003]
We develop a principled algorithm for estimating the contribution of training data points to a deep learning model. Our algorithm estimates the AME, a quantity that measures the expected (average) marginal effect of adding a data point to a subset of the training data.
arXiv Detail & Related papers (2022-06-20T21:27:18Z)
CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance. Sample re-weighting methods are popularly used to alleviate this data bias issue. We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z)
Learning to be a Statistician: Learned Estimator for Number of Distinct Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z)
Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z)
Learning a Universal Template for Few-shot Dataset Generalization [25.132729497191047]
Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem. We propose to utilize a diverse training set to construct a universal template that can define a wide array of dataset-specialized models. Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves the state-of-the-art on the challenging Meta-Dataset benchmark.
arXiv Detail & Related papers (2021-05-14T18:46:06Z)
Data augmentation and feature selection for automatic model recommendation in computational physics [0.0]
This article introduces two algorithms to address the lack of training data, their high dimensionality, and the non-applicability of common data augmentation techniques to physics data. When combined with a stacking ensemble made of six multilayer perceptrons and a ridge logistic regression, they enable reaching an accuracy of 90% on our classification problem for nonlinear structural mechanics.
arXiv Detail & Related papers (2021-01-12T15:09:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.