To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
- URL: http://arxiv.org/abs/2305.13230v2
- Date: Thu, 5 Oct 2023 14:58:02 GMT
- Title: To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
- Authors: Fuzhao Xue, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, Yang You
- Abstract summary: Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
- Score: 50.31589712761807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has highlighted the importance of dataset size in scaling
language models. However, large language models (LLMs) are notoriously
token-hungry during pre-training, and high-quality text data on the web is
approaching its scaling limit for LLMs. To further enhance LLMs, a
straightforward approach is to repeat the pre-training data for additional
epochs. In this study, we empirically investigate three key aspects under this
approach. First, we explore the consequences of repeating pre-training data,
revealing that the model is susceptible to overfitting, leading to multi-epoch
degradation. Second, we examine the key factors contributing to multi-epoch
degradation, finding that significant factors include dataset size, model
parameters, and training objectives, while less influential factors consist of
dataset quality and model FLOPs. Finally, we explore whether widely used
regularization can alleviate multi-epoch degradation. Most regularization
techniques do not yield significant improvements, except for dropout, which
demonstrates remarkable effectiveness but requires careful tuning when scaling
up the model size. Additionally, we discover that leveraging mixture-of-experts
(MoE) enables cost-effective and efficient hyper-parameter tuning for
computationally intensive dense LLMs with comparable trainable parameters,
potentially impacting efficient LLM development on a broader scale.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws [21.053622641336744]
Loss-to-loss scaling laws relate losses across pretraining datasets and downstream tasks.
Our experiments reveal that the pretraining data and tokenizer determine the scaling trend.
arXiv Detail & Related papers (2025-02-17T18:45:25Z) - Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.
We introduce novel algorithms for dynamic, instance-level data reweighting.
Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Curriculum-style Data Augmentation for LLM-based Metaphor Detection [3.864321514889099]
We propose a method for metaphor detection by fine-tuning open-source LLMs.
Our method achieves state-of-the-art performance across all baselines.
arXiv Detail & Related papers (2024-12-04T02:05:21Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale [18.015805664219673]
We explore an alternative approach to constructing an Large Language Model by continually pretraining (CPT) from existing pretrained LLMs.
We find that CPT converges faster and saves significant resources in a scalable manner.
The effectiveness of transfer at scale is influenced by training duration and linguistic properties, while robust to data replaying.
arXiv Detail & Related papers (2024-07-02T10:06:41Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.