Evaluating Sample Utility for Data Selection by Mimicking Model Weights
- URL: http://arxiv.org/abs/2501.06708v2
- Date: Sun, 02 Feb 2025 18:34:01 GMT
- Title: Evaluating Sample Utility for Data Selection by Mimicking Model Weights
- Authors: Tzu-Heng Huang, Manjot Bilkhu, Frederic Sala, Javier Movellan,
- Abstract summary: Foundation models are trained on large-scale web-crawled datasets, which often contain noise, biases, and irrelevant information.
We propose an efficient, model-based approach using the Mimic Score, a new data quality metric.
We develop Grad-Mimic, a framework that prioritizes samples for learning, creates effective filters, and automates data selection.
- Score: 12.056542160711718
- License:
- Abstract: Foundation models are trained on large-scale web-crawled datasets, which often contain noise, biases, and irrelevant information. This motivates the use of data selection techniques, which can be divided into model-free variants -- relying on heuristic rules and downstream datasets -- and model-based, e.g., using influence functions. The former can be expensive to design and risk introducing unwanted dependencies, while the latter are often computationally prohibitive. Instead, we propose an efficient, model-based approach using the Mimic Score, a new data quality metric that leverages the weights of a reference model to assess the usefulness of individual samples for training a new model. It relies on the alignment between gradients and a target direction induced by the reference model. Using the derived Mimic Scores, we develop Grad-Mimic, a framework that prioritizes samples for learning, creates effective filters, and automates data selection. Empirically, using Mimic Scores to guide training improves data efficiency, results in consistent performance gains across six image datasets, and includes enhancements to CLIP models. Moreover, Mimic Score-based filters improve upon existing filtering methods, e.g., cutting 4.7 million samples to train better CLIP models while offering accurate estimation of training dataset quality.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - When to Trust Your Data: Enhancing Dyna-Style Model-Based Reinforcement Learning With Data Filter [7.886307329450978]
Dyna-style algorithms combine two approaches by using simulated data from an estimated environmental model to accelerate model-free training.
Previous works address this issue by using model ensembles or pretraining the estimated model with data collected from the real environment.
We introduce an out-of-distribution data filter that removes simulated data from the estimated model that significantly diverges from data collected in the real environment.
arXiv Detail & Related papers (2024-10-16T01:49:03Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - A Two-Phase Recall-and-Select Framework for Fast Model Selection [13.385915962994806]
We propose a two-phase (coarse-recall and fine-selection) model selection framework.
It aims to enhance the efficiency of selecting a robust model by leveraging the models' training performances on benchmark datasets.
It has been demonstrated that the proposed methodology facilitates the selection of a high-performing model at a rate about 3x times faster than conventional baseline methods.
arXiv Detail & Related papers (2024-03-28T14:44:44Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Deep Learning Models for Knowledge Tracing: Review and Empirical
Evaluation [2.423547527175807]
We review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets.
The evaluated DLKT models have been reimplemented for assessing and replicability of previously reported results.
arXiv Detail & Related papers (2021-12-30T14:19:27Z) - Training Experimentally Robust and Interpretable Binarized Regression
Models Using Mixed-Integer Programming [3.179831861897336]
We present a model-based approach to training robust and interpretable binarized regression models for multiclass classification tasks.
Our MIP model balances the optimization of prediction margin and model size by using a weighted objective.
We show the effectiveness of training robust and interpretable binarized regression models using MIP.
arXiv Detail & Related papers (2021-12-01T11:53:08Z) - Model-specific Data Subsampling with Influence Functions [37.64859614131316]
We develop a model-specific data subsampling strategy that improves over random sampling whenever training points have varying influence.
Specifically, we leverage influence functions to guide our selection strategy, proving theoretically, and demonstrating empirically that our approach quickly selects high-quality models.
arXiv Detail & Related papers (2020-10-20T12:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.