Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective
- URL: http://arxiv.org/abs/2406.14124v2
- Date: Fri, 21 Jun 2024 02:30:32 GMT
- Title: Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective
- Authors: Minsang Kim, Seungjun Baek,
- Abstract summary: We consider data pruning as a method of data-efficient training of large language models.
We leverage log-likelihood function of trained models as a surrogate to measure information content of samples.
- Score: 4.079147243688765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compute-efficient training of large language models (LLMs) has become an important research problem. In this work, we consider data pruning as a method of data-efficient training of LLMs, where we take a data compression view on data pruning. We argue that the amount of information of a sample, or the achievable compression on its description length, represents its sample importance. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We leverage log-likelihood function of trained models as a surrogate to measure information content of samples. Experiments reveal a surprising insight that information-based pruning can enhance the generalization capability of the model, improves upon language modeling and downstream tasks as compared to the model trained on the entire dataset.
Related papers
- TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment [30.93798042712827]
Training language models (LMs) and their application agents is increasingly costly due to large datasets and models.
We propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns.
Our experiments show that leaner pre-training boosts LM learning efficiency.
arXiv Detail & Related papers (2024-12-31T16:08:15Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization [30.738229850748137]
MolPeg is a Molecular data Pruning framework for enhanced Generalization.
It focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models.
It consistently outperforms existing DP methods across four downstream tasks.
arXiv Detail & Related papers (2024-09-02T09:06:04Z) - Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Critical Data Size of Language Models from a Grokking Perspective [35.029074833552656]
We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis.
We show that generalization occurs only when language models reach a critical size.
Our results deepen the understanding of language model training, offering a novel perspective on the role of data in the learning mechanism of language models.
arXiv Detail & Related papers (2024-01-19T03:24:36Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback.
ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements.
We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z) - Eeny, meeny, miny, moe. How to choose data for morphological inflection [8.914777617216862]
This paper explores four sampling strategies for the task of morphological inflection using a Transformer model.
We investigate the robustness of each strategy across 30 typologically diverse languages.
Our results show a clear benefit to selecting data based on model confidence and entropy.
arXiv Detail & Related papers (2022-10-26T04:33:18Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Data Impressions: Mining Deep Models to Extract Samples for Data-free
Applications [26.48630545028405]
"Data Impressions" act as proxy to the training data and can be used to realize a variety of tasks.
We show the applicability of data impressions in solving several computer vision tasks.
arXiv Detail & Related papers (2021-01-15T11:37:29Z) - Fine-tuning BERT for Low-Resource Natural Language Understanding via
Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model.
Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model.
We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.