Related papers: AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

URL: http://arxiv.org/abs/2507.05411v2
Date: Wed, 09 Jul 2025 20:10:51 GMT
Title: AXLearn: Modular Large Model Training on Heterogeneous Infrastructure
Authors: Mark Lee, Tom Gunter, Chang Lan, John Peebles, Hanzhi Zhou, Kelvin Zou, Sneha Bangalore, Chung-Cheng Chiu, Nan Du, Xianzhi Du, Philipp Dufter, Ruixuan Hou, Haoshuo Huang, Dongseong Hwang, Xiang Kong, Jinhao Lei, Tao Lei, Meng Li, Li Li, Jiarui Lu, Zhiyun Lu, Yiping Ma, David Qiu, Vivek Rathod, Senyu Tong, Zhucheng Tu, Jianyu Wang, Yongqiang Wang, Zirui Wang, Floris Weers, Sam Wiseman, Guoli Yin, Bowen Zhang, Xiyou Zhou, Danyang Zhuo, Cheng Leong, Ruoming Pang,
Abstract summary: AXLearn is a production deep learning system that facilitates scalable and high-performance training of large deep learning models.<n>Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure.
Score: 64.33868455931301
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via Lines-of-Code (LoC)-complexity, which demonstrates how our system maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in other systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn.

Related papers

XxaCT-NN: Structure Agnostic Multimodal Learning for Materials Science [0.27185251060695437]
We propose a scalable framework that learns directly from elemental composition and X-ray diffraction (XRD)<n>Our architecture integrates modality-specific encoders with a cross-attention fusion module and is trained on the 5-million-sample Alexandria dataset.<n>Our results establish a path toward structure-free, experimentally grounded foundation models for materials science.
arXiv Detail & Related papers (2025-06-27T21:45:56Z)
Adaptive Orchestration of Modular Generative Information Access Systems [59.102816309859584]
We argue that the architecture of future modular generative information access systems will not just assemble powerful components, but enable a self-organizing system.<n>This perspective urges the IR community to rethink modular system designs for developing adaptive, self-optimizing, and future-ready architectures.
arXiv Detail & Related papers (2025-04-24T11:35:43Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Hierarchical and Decoupled BEV Perception Learning Framework for Autonomous Driving [52.808273563372126]
This paper proposes a novel hierarchical BEV perception paradigm, aiming to provide a library of fundamental perception modules and user-friendly graphical interface. We conduct the Pretrain-Finetune strategy to effectively utilize large scale public datasets and streamline development processes. We also present a Multi-Module Learning (MML) approach, enhancing performance through synergistic and iterative training of multiple models.
arXiv Detail & Related papers (2024-07-17T11:17:20Z)
Modular Deep Learning [120.36599591042908]
Transfer learning has recently become the dominant paradigm of machine learning. It remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference. Modular deep learning has emerged as a promising solution to these challenges.
arXiv Detail & Related papers (2023-02-22T18:11:25Z)
Learning Modular Simulations for Homogeneous Systems [23.355189771765644]
We present a modular simulation framework for modeling homogeneous multibody dynamical systems. An arbitrary number of modules can be combined to simulate systems of a variety of coupling topologies. We show that our models can be transferred to new system configurations lower with data requirement and training effort, compared to those trained from scratch.
arXiv Detail & Related papers (2022-10-28T17:48:01Z)
ModLaNets: Learning Generalisable Dynamics via Modularity and Physical Inductive Bias [14.474273671369584]
We propose a structural neural network framework with modularity and physical inductive bias. This framework models the energy of each element using modularity and then construct the target dynamical system via Lagrangian mechanics. We examine our framework for modelling double-pendulum or three-body systems with small training datasets.
arXiv Detail & Related papers (2022-06-24T14:54:25Z)
A unified software/hardware scalable architecture for brain-inspired computing based on self-organizing neural models [6.072718806755325]
We develop an original brain-inspired neural model associating Self-Organizing Maps (SOM) and Hebbian learning in the Reentrant SOM (ReSOM) model. This work also demonstrates the distributed and scalable nature of the model through both simulation results and hardware execution on a dedicated FPGA-based platform.
arXiv Detail & Related papers (2022-01-06T22:02:19Z)
XY Neural Networks [0.0]
We show how to build complex structures for machine learning based on the XY model's nonlinear blocks. The final target is to reproduce the deep learning architectures, which can perform complicated tasks.
arXiv Detail & Related papers (2021-03-31T17:47:10Z)
S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures. We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.