Experience of Training a 1.7B-Parameter LLaMa Model From Scratch
- URL: http://arxiv.org/abs/2412.13335v2
- Date: Fri, 20 Dec 2024 16:00:33 GMT
- Title: Experience of Training a 1.7B-Parameter LLaMa Model From Scratch
- Authors: Miles Q. Li, Benjamin C. M. Fung, Shih-Chia Huang,
- Abstract summary: We share insights gained from training DMaS-LLaMa-Lite on approximately 20 billion tokens of data.
We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output.
By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies.
- Score: 10.39475177812483
- License:
- Abstract: Pretraining large language models is a complex endeavor influenced by multiple factors, including model architecture, data quality, training continuity, and hardware constraints. In this paper, we share insights gained from the experience of training DMaS-LLaMa-Lite, a fully open source, 1.7-billion-parameter, LLaMa-based model, on approximately 20 billion tokens of carefully curated data. We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output. Beyond pretraining, we extend our analysis to include a post-training phase focused on instruction tuning, where the model was refined to produce more contextually appropriate, user-aligned responses. We highlight practical considerations such as the importance of restoring optimizer states when resuming from checkpoints, and the impact of hardware changes on training stability and throughput. While qualitative evaluation provides an intuitive understanding of model improvements, our analysis extends to various performance benchmarks, demonstrating how high-quality data and thoughtful scaling enable competitive results with significantly fewer training tokens. By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies. The training script is available on Github at https://github.com/McGill-DMaS/DMaS-LLaMa-Lite-Training-Code. The model checkpoints are available on Huggingface at https://huggingface.co/collections/McGill-DMaS/dmas-llama-lite-6761d97ba903f82341954ceb.
Related papers
- What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [117.48444197402858]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing.
ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model.
We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z) - Feeding What You Need by Understanding What You Learned [54.400455868448695]
Machine Reading (MRC) reveals the ability to understand a given text passage and answer questions based on it.
Existing research works in MRC rely heavily on large-size models and corpus to improve the performance evaluated by metrics such as Exact Match.
We argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data.
arXiv Detail & Related papers (2022-03-05T14:15:59Z) - Training Dynamics for Text Summarization Models [45.62439188988816]
We analyze the training dynamics for generation models, focusing on news summarization.
Across different datasets (CNN/DM, XSum, MediaSum) and summary properties, we study what the model learns at different stages of its fine-tuning process.
We find that properties such as copy behavior are learnt earlier in the training process and these observations are robust across domains.
On the other hand, factual errors, such as hallucination of unsupported facts, are learnt in the later stages, and this behavior is more varied across domains.
arXiv Detail & Related papers (2021-10-15T21:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.