To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
- URL: http://arxiv.org/abs/2305.13230v2
- Date: Thu, 5 Oct 2023 14:58:02 GMT
- Title: To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
- Authors: Fuzhao Xue, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, Yang You
- Abstract summary: Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
- Score: 50.31589712761807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has highlighted the importance of dataset size in scaling
language models. However, large language models (LLMs) are notoriously
token-hungry during pre-training, and high-quality text data on the web is
approaching its scaling limit for LLMs. To further enhance LLMs, a
straightforward approach is to repeat the pre-training data for additional
epochs. In this study, we empirically investigate three key aspects under this
approach. First, we explore the consequences of repeating pre-training data,
revealing that the model is susceptible to overfitting, leading to multi-epoch
degradation. Second, we examine the key factors contributing to multi-epoch
degradation, finding that significant factors include dataset size, model
parameters, and training objectives, while less influential factors consist of
dataset quality and model FLOPs. Finally, we explore whether widely used
regularization can alleviate multi-epoch degradation. Most regularization
techniques do not yield significant improvements, except for dropout, which
demonstrates remarkable effectiveness but requires careful tuning when scaling
up the model size. Additionally, we discover that leveraging mixture-of-experts
(MoE) enables cost-effective and efficient hyper-parameter tuning for
computationally intensive dense LLMs with comparable trainable parameters,
potentially impacting efficient LLM development on a broader scale.
Related papers
- Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Empirical Insights on Fine-Tuning Large Language Models for Question-Answering [50.12622877002846]
Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can be fine-tuned for the question-answering (QA) task.
We categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs.
Our experiments show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task.
arXiv Detail & Related papers (2024-09-24T07:38:38Z) - Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale [18.015805664219673]
We explore an alternative approach to constructing an Large Language Model by continually pretraining (CPT) from existing pretrained LLMs.
We find that CPT converges faster and saves significant resources in a scalable manner.
The effectiveness of transfer at scale is influenced by training duration and linguistic properties, while robust to data replaying.
arXiv Detail & Related papers (2024-07-02T10:06:41Z) - Self-training Large Language Models through Knowledge Detection [26.831873737733737]
Large language models (LLMs) often necessitate extensive labeled datasets and training compute to achieve impressive performance across downstream tasks.
This paper explores a self-training paradigm, where the LLM autonomously curates its own labels and selectively trains on unknown data samples.
Empirical evaluations demonstrate significant improvements in reducing hallucination in generation across multiple subjects.
arXiv Detail & Related papers (2024-06-17T07:25:09Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.