Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
- URL: http://arxiv.org/abs/2511.17127v1
- Date: Fri, 21 Nov 2025 10:44:02 GMT
- Title: Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
- Authors: Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Rishi Iyer, Vasu Shyam, Anna Golubeva, Ansh Chaurasia, Xiao Yang, Tomas Figliolia, Robert Washbourne, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten, Emad Barsoum, Zhenyu Gu, Yao Fu, Beren Millidge,
- Abstract summary: We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware.<n>We distill practical guidance for both systems and model design.<n>ZAYA1-base performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger.
- Score: 26.12152103450326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.
Related papers
- INTELLECT-3: Technical Report [5.3998786788822]
INTELLECT-3 is a Mixture-of-Experts model (12B active) trained with large-scale reinforcement learning.<n>We open-source the model together with the full infrastructure stack used to create it, including RL frameworks.<n>We introduce prime-rl, an open framework for large-scale asynchronous reinforcement learning.
arXiv Detail & Related papers (2025-12-18T03:57:01Z) - Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM [11.87842612818933]
Training Large Language Models (LLMs) is one of the most compute-intensive tasks in high-performance computing.<n>We present a framework to predict end-to-end training time for multi-billion parameter models distributed across hundreds of GPU.<n>Our framework achieves low average prediction errors-4.98% on Perlmutter(A100) and 9.38% on Vista(GH200)-for models up to 20B parameters across 128 GPU.
arXiv Detail & Related papers (2025-09-26T18:38:25Z) - MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining [60.02032710118597]
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages.<n>MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed.<n>The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini.
arXiv Detail & Related papers (2025-05-12T14:30:11Z) - Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z) - 2 OLMo 2 Furious [154.15728448754854]
We present OLMo 2, the next generation of our fully open language models.<n> OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales.<n>We describe our modified model architecture and training recipe.
arXiv Detail & Related papers (2024-12-31T21:55:10Z) - OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance [67.37017498784748]
Large-scale 3D parallel training on vision-language instruction-tuning models leads to an imbalanced computation load across different devices.<n>We rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices.<n>Our method's efficacy and generalizability are further validated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding.
This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks.
We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z) - MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems [6.8519529064678375]
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs.
To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max.
This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities.
arXiv Detail & Related papers (2023-10-04T13:00:53Z) - MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [32.278096820269816]
We present a novel MoESys that boosts efficiency in both large-scale training and inference.
Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage.
For scalable inference in a single node, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
arXiv Detail & Related papers (2022-05-20T09:09:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.