TRIAGE: Characterizing and auditing training data for improved
regression
- URL: http://arxiv.org/abs/2310.18970v1
- Date: Sun, 29 Oct 2023 10:31:59 GMT
- Title: TRIAGE: Characterizing and auditing training data for improved
regression
- Authors: Nabeel Seedat, Jonathan Crabb\'e, Zhaozhi Qian, Mihaela van der Schaar
- Abstract summary: We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
- Score: 80.11415390605215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data quality is crucial for robust machine learning algorithms, with the
recent interest in data-centric AI emphasizing the importance of training data
characterization. However, current data characterization methods are largely
focused on classification settings, with regression settings largely
understudied. To address this, we introduce TRIAGE, a novel data
characterization framework tailored to regression tasks and compatible with a
broad class of regressors. TRIAGE utilizes conformal predictive distributions
to provide a model-agnostic scoring method, the TRIAGE score. We operationalize
the score to analyze individual samples' training dynamics and characterize
samples as under-, over-, or well-estimated by the model. We show that TRIAGE's
characterization is consistent and highlight its utility to improve performance
via data sculpting/filtering, in multiple regression settings. Additionally,
beyond sample level, we show TRIAGE enables new approaches to dataset selection
and feature acquisition. Overall, TRIAGE highlights the value unlocked by data
characterization in real-world regression applications
Related papers
- Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - A Conditioned Unsupervised Regression Framework Attuned to the Dynamic Nature of Data Streams [0.0]
This paper presents an optimal strategy for streaming contexts with limited labeled data, introducing an adaptive technique for unsupervised regression.
The proposed method leverages a sparse set of initial labels and introduces an innovative drift detection mechanism.
To enhance adaptability, we integrate the ADWIN (ADaptive WINdowing) algorithm with error generalization based on Root Mean Square Error (RMSE)
arXiv Detail & Related papers (2023-12-12T19:23:54Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - Towards Open-World Feature Extrapolation: An Inductive Graph Learning
Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning.
Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z) - Variation-Incentive Loss Re-weighting for Regression Analysis on Biased
Data [8.115323786541078]
We aim to improve the accuracy of the regression analysis by addressing the data skewness/bias during model training.
We propose a Variation-Incentive Loss re-weighting method (VILoss) to optimize the gradient descent-based model training for regression analysis.
arXiv Detail & Related papers (2021-09-14T10:22:21Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z) - RENT -- Repeated Elastic Net Technique for Feature Selection [0.46180371154032895]
We present the Repeated Elastic Net Technique (RENT) for Feature Selection.
RENT uses an ensemble of generalized linear models with elastic net regularization, each trained on distinct subsets of the training data.
RENT provides valuable information for model interpretation concerning the identification of objects in the data that are difficult to predict during training.
arXiv Detail & Related papers (2020-09-27T07:55:52Z) - S^3-Rec: Self-Supervised Learning for Sequential Recommendation with
Mutual Information Maximization [104.87483578308526]
We propose the model S3-Rec, which stands for Self-Supervised learning for Sequential Recommendation.
For our task, we devise four auxiliary self-supervised objectives to learn the correlations among attribute, item, subsequence, and sequence.
Extensive experiments conducted on six real-world datasets demonstrate the superiority of our proposed method over existing state-of-the-art methods.
arXiv Detail & Related papers (2020-08-18T11:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.