veScale-FSDP: Flexible and High-Performance FSDP at Scale
- URL: http://arxiv.org/abs/2602.22437v2
- Date: Fri, 27 Feb 2026 03:28:11 GMT
- Title: veScale-FSDP: Flexible and High-Performance FSDP at Scale
- Authors: Zezhou Wang, Youjie Li, Zhiqi Lin, Jiacheng Yang, Cong Xie, Guanyu Feng, Zheng Zhong, Ziyue Huang, Hongyu Zhu, Zhi Zhang, Yanghua Peng, Xin Liu,
- Abstract summary: We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm.<n>veScale-FSDP achieves 566% higher throughput and 1630% lower memory usage than existing FSDP systems.
- Score: 14.906460825287162
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
Related papers
- PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - FlashDP: Private Training Large Language Models with Efficient DP-SGD [11.39146065534273]
FlashDP is an innovative cache-friendly per-layer DP-SGD that consolidates necessary operations into a single task, calculating gradients only once in a fused manner.<n>This paper introduces FlashDP, an innovative cache-friendly per-layer DP-SGD that consolidates necessary operations into a single task, calculating gradients only once in a fused manner.
arXiv Detail & Related papers (2025-07-01T19:28:37Z) - MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism [26.923312725688735]
Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity.<n>We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models.
arXiv Detail & Related papers (2025-04-03T04:20:44Z) - FLAMES: A Hybrid Spiking-State Space Model for Adaptive Memory Retention in Event-Based Learning [16.60622265961373]
FLAMES is a hybrid framework integrating structured state-space dynamics with event-driven computation.<n>By bridging neuromorphic computing and structured sequence modeling, FLAMES enables scalable long-range reasoning in event-driven systems.
arXiv Detail & Related papers (2025-04-02T00:08:19Z) - Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers [65.35142508909892]
We present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN.<n>We demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.
arXiv Detail & Related papers (2025-02-12T06:05:52Z) - SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile [7.544642148576768]
SimpleFSDP is a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework.
It has a simple implementation for maintenance and computationsability, allows full compo-communication graph tracing, and brings performance enhancement via compiler backend optimizations.
It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping.
arXiv Detail & Related papers (2024-11-01T00:43:54Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - Towards A Unified View of Sparse Feed-Forward Network in Pretraining
Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models.
We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method.
We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z) - PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel [19.24542340170026]
We introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training.
FSDP provides support for significantly larger models with near-linear scalability in terms of TFLOPS.
arXiv Detail & Related papers (2023-04-21T23:52:27Z) - Efficient Parallel Split Learning over Resource-constrained Wireless
Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL)
We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training.
We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Joint Superposition Coding and Training for Federated Learning over
Multi-Width Neural Networks [52.93232352968347]
This paper aims to integrate two synergetic technologies, federated learning (FL) and width-adjustable slimmable neural network (SNN)
FL preserves data privacy by exchanging the locally trained models of mobile devices. SNNs are however non-trivial, particularly under wireless connections with time-varying channel conditions.
We propose a communication and energy-efficient SNN-based FL (named SlimFL) that jointly utilizes superposition coding (SC) for global model aggregation and superposition training (ST) for updating local models.
arXiv Detail & Related papers (2021-12-05T11:17:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.