Related papers: Self-Influence Guided Data Reweighting for Language Model Pre-training

Self-Influence Guided Data Reweighting for Language Model Pre-training

URL: http://arxiv.org/abs/2311.00913v1
Date: Thu, 2 Nov 2023 01:00:46 GMT
Title: Self-Influence Guided Data Reweighting for Language Model Pre-training
Authors: Megh Thakkar, Tolga Bolukbasi, Sriram Ganapathy, Shikhar Vashishth, Sarath Chandar, Partha Talukdar
Abstract summary: Language Models (LMs) pre-trained with self-supervision on large text corpora have become the default starting point for developing models for various NLP tasks. All data samples in the corpus are treated with equal importance during LM pre-training. Due to varying levels of relevance and quality of data, equal importance to all the data samples may not be the optimal choice. We propose PRESENCE, a method for jointly reweighting samples by leveraging self-influence (SI) scores as an indicator of sample importance and pre-training.
Score: 46.57714637505164
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language Models (LMs) pre-trained with self-supervision on large text corpora have become the default starting point for developing models for various NLP tasks. Once the pre-training corpus has been assembled, all data samples in the corpus are treated with equal importance during LM pre-training. However, due to varying levels of relevance and quality of data, equal importance to all the data samples may not be the optimal choice. While data reweighting has been explored in the context of task-specific supervised learning and LM fine-tuning, model-driven reweighting for pre-training data has not been explored. We fill this important gap and propose PRESENCE, a method for jointly reweighting samples by leveraging self-influence (SI) scores as an indicator of sample importance and pre-training. PRESENCE promotes novelty and stability for model pre-training. Through extensive analysis spanning multiple model sizes, datasets, and tasks, we present PRESENCE as an important first step in the research direction of sample reweighting for pre-training language models.

Related papers

ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection [28.75333303894706]
ToReMi is a novel framework that adjusts training sample weights according to their topical associations and observed learning patterns. Our experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches.
arXiv Detail & Related papers (2025-04-01T12:06:42Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance. We introduce novel algorithms for dynamic, instance-level data reweighting. Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [51.41246396610475]
This paper aims to predict performance in closed-book question answering (QA) without the help of external tools.<n>We conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models.<n>Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
Analyzing Similarity Metrics for Data Selection for Language Model Pretraining [39.02299450717135]
Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models.<n>Standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting.
arXiv Detail & Related papers (2025-02-04T17:09:44Z)
Measuring Pre-training Data Quality without Labels for Time Series Foundation Models [10.64362760848387]
We introduce contrastive accuracy, a new measure to evaluate the quality of the representation space learned by the foundation model. Our experiments reveal the positive correlation between the proposed measure and the accuracy of the model on a collection of downstream tasks.
arXiv Detail & Related papers (2024-12-09T10:38:30Z)
Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. Existing approaches require re-training models on different data subsets, which is computationally intensive. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z)
How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities. We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z)
Examining the Effect of Pre-training on Time Series Classification [21.38211396933795]
This study investigates the impact of pre-training followed by fine-tuning on the fine-tuning process. We conducted a thorough examination of 150 classification datasets. We find that pre-training can only help improve the optimization process for models that fit the data poorly. Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume.
arXiv Detail & Related papers (2023-09-11T06:26:57Z)
SEPT: Towards Scalable and Efficient Visual Pre-Training [11.345844145289524]
Self-supervised pre-training has shown great potential in leveraging large-scale unlabeled data to improve downstream task performance. We build a task-specific self-supervised pre-training framework based on a simple hypothesis that pre-training on the unlabeled samples with similar distribution to the target task can bring substantial performance gains.
arXiv Detail & Related papers (2022-12-11T11:02:11Z)
Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z)
Improved Fine-tuning by Leveraging Pre-training Data: Theory and Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications. Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy. We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z)
On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance. We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)
Reinforced Curriculum Learning on Pre-trained Neural Machine Translation Models [20.976165305749777]
We learn a curriculum for improving a pre-trained NMT model by re-selecting influential data samples from the original training set. We propose a data selection framework based on Deterministic Actor-Critic, in which a critic network predicts the expected change of model performance.
arXiv Detail & Related papers (2020-04-13T03:40:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.