DIVA: Dataset Derivative of a Learning Task
- URL: http://arxiv.org/abs/2111.09785v1
- Date: Thu, 18 Nov 2021 16:33:12 GMT
- Title: DIVA: Dataset Derivative of a Learning Task
- Authors: Yonatan Dukler, Alessandro Achille, Giovanni Paolini, Avinash
Ravichandran, Marzia Polito, Stefano Soatto
- Abstract summary: We present a method to compute the derivative of a learning task with respect to a dataset.
A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN)
The "dataset derivative" is a linear operator, computed around the trained model, that informs how outliers of the weight of each training sample affect the validation error.
- Score: 108.18912044384213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method to compute the derivative of a learning task with respect
to a dataset. A learning task is a function from a training set to the
validation error, which can be represented by a trained deep neural network
(DNN). The "dataset derivative" is a linear operator, computed around the
trained model, that informs how perturbations of the weight of each training
sample affect the validation error, usually computed on a separate validation
dataset. Our method, DIVA (Differentiable Validation) hinges on a closed-form
differentiable expression of the leave-one-out cross-validation error around a
pre-trained DNN. Such expression constitutes the dataset derivative. DIVA could
be used for dataset auto-curation, for example removing samples with faulty
annotations, augmenting a dataset with additional relevant samples, or
rebalancing. More generally, DIVA can be used to optimize the dataset, along
with the parameters of the model, as part of the training process without the
need for a separate validation dataset, unlike bi-level optimization methods
customary in AutoML. To illustrate the flexibility of DIVA, we report
experiments on sample auto-curation tasks such as outlier rejection, dataset
extension, and automatic aggregation of multi-modal data.
Related papers
- Derivative-based regularization for regression [3.0408645115035036]
We introduce a novel approach to regularization in multivariable regression problems.
Our regularizer, called DLoss, penalises differences between the model's derivatives and derivatives of the data generating function as estimated from the training data.
arXiv Detail & Related papers (2024-05-01T14:57:59Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and
Beyond [93.96982273042296]
Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions.
We have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding.
We propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data.
We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation.
arXiv Detail & Related papers (2023-10-23T08:09:42Z) - MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values.
We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective.
We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z) - Data Provenance via Differential Auditing [5.7962871424710665]
We introduce Data Provenance via Differential Auditing (DPDA), a practical framework for auditing data provenance.
We propose two effective auditing function implementations, an additive one and a multiplicative one.
We report evaluations on real-world data sets demonstrating the effectiveness of our proposed auditing technique.
arXiv Detail & Related papers (2022-09-04T06:02:25Z) - A Penalty Approach for Normalizing Feature Distributions to Build
Confounder-Free Models [11.818509522227565]
MetaData Normalization (MDN) estimates the linear relationship between the metadata and each feature based on a non-trainable closed-form solution.
We extend the MDN method by applying a Penalty approach (referred to as PDMN)
We show improvement in model accuracy and greater independence from confounders using PMDN over MDN in a synthetic experiment and a multi-label, multi-site dataset of magnetic resonance images (MRIs)
arXiv Detail & Related papers (2022-07-11T04:02:12Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z) - Deep Active Learning for Biased Datasets via Fisher Kernel
Self-Supervision [5.352699766206807]
Active learning (AL) aims to minimize labeling efforts for data-demanding deep neural networks (DNNs)
We propose a low-complexity method for feature density matching using self-supervised Fisher kernel (FK)
Our method outperforms state-of-the-art methods on MNIST, SVHN, and ImageNet classification while requiring only 1/10th of processing.
arXiv Detail & Related papers (2020-03-01T03:56:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.