Related papers: How to Achieve Higher Accuracy with Less Training Points?

How to Achieve Higher Accuracy with Less Training Points?

URL: http://arxiv.org/abs/2504.13586v1
Date: Fri, 18 Apr 2025 09:38:26 GMT
Title: How to Achieve Higher Accuracy with Less Training Points?
Authors: Jinghan Yang, Anupam Pani, Yunchao Zhang,
Abstract summary: We propose a technique based on influence functions to determine which training samples should be included in the training set.<n>Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data.
Score: 2.1834099301440526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the era of large-scale model training, the extensive use of available datasets has resulted in significant computational inefficiencies. To tackle this issue, we explore methods for identifying informative subsets of training data that can achieve comparable or even superior model performance. We propose a technique based on influence functions to determine which training samples should be included in the training set. We conducted empirical evaluations of our method on binary classification tasks utilizing logistic regression models. Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data. Furthermore, we found that our method achieved even higher accuracy when trained with just 60% of the data.

Related papers

Z0-Inf: Zeroth Order Approximation for Data Influence [47.682602051124235]
We introduce a highly efficient zeroth-order approximation for estimating the influence of training data.<n>Our approach achieves superior accuracy in estimating self-influence and comparable or improved accuracy in estimating train-test influence for fine-tuned large language models.
arXiv Detail & Related papers (2025-10-13T18:30:37Z)
Effective Data Pruning through Score Extrapolation [40.61665742457229]
We introduce a novel importance score extrapolation framework that requires training on only a small subset of data.<n>We present two initial approaches in this framework to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset.<n>Our results indicate that score extrapolation is a promising direction to scale expensive score calculation methods, such as pruning, data attribution, or other tasks.
arXiv Detail & Related papers (2025-06-10T17:38:49Z)
Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.<n>We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.<n>As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Exploring Learning Complexity for Efficient Downstream Dataset Pruning [8.990878450631596]
Existing dataset pruning methods require training on the entire dataset. We propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC) Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters.
arXiv Detail & Related papers (2024-02-08T02:29:33Z)
Online Importance Sampling for Stochastic Gradient Optimization [33.42221341526944]
We propose a practical algorithm that efficiently computes data importance on-the-fly during training.<n>We also introduce a novel metric based on the derivative of the loss w.r.t. the network output, designed for mini-batch importance sampling.
arXiv Detail & Related papers (2023-11-24T13:21:35Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
Topological Quality of Subsets via Persistence Matching Diagrams [0.196629787330046]
We measure the quality of a subset concerning the dataset it represents using topological data analysis techniques. In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.
arXiv Detail & Related papers (2023-06-04T17:08:41Z)
Dataset Pruning: Reducing Training Data by Examining Generalization Influence [30.30255670341501]
Do all training data contribute to model's performance? How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
arXiv Detail & Related papers (2022-05-19T05:36:35Z)
Learning to be a Statistician: Learned Estimator for Number of Distinct Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z)
DIVINE: Diverse Influential Training Points for Data Visualization and Model Refinement [32.045420977032926]
We propose a method to select a set of DIVerse INfluEntial (DIVINE) training points as a useful explanation of model behavior. Our method can identify unfairness-inducing training points, which can be removed to improve fairness outcomes.
arXiv Detail & Related papers (2021-07-13T10:50:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.