TRIAGE: Characterizing and auditing training data for improved
regression
- URL: http://arxiv.org/abs/2310.18970v1
- Date: Sun, 29 Oct 2023 10:31:59 GMT
- Title: TRIAGE: Characterizing and auditing training data for improved
regression
- Authors: Nabeel Seedat, Jonathan Crabb\'e, Zhaozhi Qian, Mihaela van der Schaar
- Abstract summary: We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
- Score: 80.11415390605215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data quality is crucial for robust machine learning algorithms, with the
recent interest in data-centric AI emphasizing the importance of training data
characterization. However, current data characterization methods are largely
focused on classification settings, with regression settings largely
understudied. To address this, we introduce TRIAGE, a novel data
characterization framework tailored to regression tasks and compatible with a
broad class of regressors. TRIAGE utilizes conformal predictive distributions
to provide a model-agnostic scoring method, the TRIAGE score. We operationalize
the score to analyze individual samples' training dynamics and characterize
samples as under-, over-, or well-estimated by the model. We show that TRIAGE's
characterization is consistent and highlight its utility to improve performance
via data sculpting/filtering, in multiple regression settings. Additionally,
beyond sample level, we show TRIAGE enables new approaches to dataset selection
and feature acquisition. Overall, TRIAGE highlights the value unlocked by data
characterization in real-world regression applications
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Targeted synthetic data generation for tabular data via hardness characterization [0.0]
We introduce a novel augmentation pipeline that generates only high-value training points based on hardness characterization.
We show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on simulated data and on a large scale credit default prediction task.
arXiv Detail & Related papers (2024-10-01T14:54:26Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - A Conditioned Unsupervised Regression Framework Attuned to the Dynamic Nature of Data Streams [0.0]
This paper presents an optimal strategy for streaming contexts with limited labeled data, introducing an adaptive technique for unsupervised regression.
The proposed method leverages a sparse set of initial labels and introduces an innovative drift detection mechanism.
To enhance adaptability, we integrate the ADWIN (ADaptive WINdowing) algorithm with error generalization based on Root Mean Square Error (RMSE)
arXiv Detail & Related papers (2023-12-12T19:23:54Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Towards Open-World Feature Extrapolation: An Inductive Graph Learning
Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning.
Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z) - Variation-Incentive Loss Re-weighting for Regression Analysis on Biased
Data [8.115323786541078]
We aim to improve the accuracy of the regression analysis by addressing the data skewness/bias during model training.
We propose a Variation-Incentive Loss re-weighting method (VILoss) to optimize the gradient descent-based model training for regression analysis.
arXiv Detail & Related papers (2021-09-14T10:22:21Z) - RENT -- Repeated Elastic Net Technique for Feature Selection [0.46180371154032895]
We present the Repeated Elastic Net Technique (RENT) for Feature Selection.
RENT uses an ensemble of generalized linear models with elastic net regularization, each trained on distinct subsets of the training data.
RENT provides valuable information for model interpretation concerning the identification of objects in the data that are difficult to predict during training.
arXiv Detail & Related papers (2020-09-27T07:55:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.