Towards Effective and Efficient Continual Pre-training of Large Language Models
- URL: http://arxiv.org/abs/2407.18743v1
- Date: Fri, 26 Jul 2024 13:55:21 GMT
- Title: Towards Effective and Efficient Continual Pre-training of Large Language Models
- Authors: Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ji-Rong Wen,
- Abstract summary: Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks.
This paper presents a technical report for continually pre-training Llama-3 (8B)
It significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model.
- Score: 163.34610964970258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. To make the CPT approach more traceable, this paper presents a technical report for continually pre-training Llama-3 (8B), which significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model. To enhance the new abilities while retaining the original abilities, we design specific data mixture and curriculum strategies by utilizing existing datasets and synthesizing high-quality datasets. Specifically, we synthesize multidisciplinary scientific question and answer (QA) pairs based on related web pages, and subsequently incorporate these synthetic data to improve the scientific reasoning ability of Llama-3. We refer to the model after CPT as Llama-3-SynE (Synthetic data Enhanced Llama-3). We also present the tuning experiments with a relatively small model -- TinyLlama, and employ the derived findings to train the backbone model. Extensive experiments on a number of evaluation benchmarks show that our approach can largely improve the performance of the backbone models, including both the general abilities (+8.81 on C-Eval and +6.31 on CMMLU) and the scientific reasoning abilities (+12.00 on MATH and +4.13 on SciEval), without hurting the original capacities. Our model, data, and codes are available at https://github.com/RUC-GSAI/Llama-3-SynE.
Related papers
- Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data [7.603659241572307]
We propose a novel UCB-based training procedure combined with a dynamic usability metric.
Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets.
We show that our metric is an effective way to rank synthetic images based on their usability.
arXiv Detail & Related papers (2024-12-06T23:36:36Z) - Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.
Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Enhancing SLM via ChatGPT and Dataset Augmentation [0.3844771221441211]
We employ knowledge distillation-based techniques and synthetic dataset augmentation to bridge the performance gap between large language models (LLMs) and small language models (SLMs)
Our methods involve two forms of rationale generation--information extraction and informed reasoning--to enrich the ANLI dataset.
Our findings reveal that the incorporation of synthetic rationales significantly improves the model's ability to comprehend natural language, leading to 1.3% and 2.3% higher classification accuracy, respectively, on the ANLI dataset.
arXiv Detail & Related papers (2024-09-19T09:24:36Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - POINTS: Improving Your Vision-language Model with Affordable Strategies [28.611705477757454]
We train a robust baseline model using latest advancements in vision-language models.
We filter pre-training data using perplexity, selecting the lowest perplexity data for training.
During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements.
arXiv Detail & Related papers (2024-09-07T13:41:37Z) - On Machine Learning Approaches for Protein-Ligand Binding Affinity Prediction [2.874893537471256]
This study evaluates the performance of classical tree-based models and advanced neural networks in protein-ligand binding affinity prediction.
We show that combining 2D and 3D model strengths improves active learning outcomes beyond current state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-15T13:06:00Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.