SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
- URL: http://arxiv.org/abs/2411.00284v2
- Date: Tue, 05 Nov 2024 18:36:14 GMT
- Title: SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
- Authors: Ruisi Zhang, Tianyu Liu, Will Feng, Andrew Gu, Sanket Purandare, Wanchao Liang, Francisco Massa,
- Abstract summary: SimpleFSDP is a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework.
It has a simple implementation for maintenance and computationsability, allows full compo-communication graph tracing, and brings performance enhancement via compiler backend optimizations.
It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping.
- Score: 7.544642148576768
- License:
- Abstract: Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques. This paper presents SimpleFSDP, a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework, which has a simple implementation for maintenance and composability, allows full computation-communication graph tracing, and brings performance enhancement via compiler backend optimizations. SimpleFSDP's novelty lies in its unique $torch.compile$-friendly implementation of collective communications using existing PyTorch primitives, namely parametrizations, selective activation checkpointing, and DTensor. It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping. As a result, users can employ the aforementioned optimizations to automatically or manually wrap model components for minimal communication exposure. Extensive evaluations of SimpleFSDP on Llama 3 models (including the ultra-large 405B) using TorchTitan demonstrate up to 28.54% memory reduction and 68.67% throughput improvement compared to the most widely adopted FSDP2 eager framework, when composed with other distributed training techniques.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers.
Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data.
Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - SpaFL: Communication-Efficient Federated Learning with Sparse Models and Low computational Overhead [75.87007729801304]
SpaFL: a communication-efficient FL framework is proposed to optimize sparse model structures with low computational overhead.
Experiments show that SpaFL improves accuracy while requiring much less communication and computing resources compared to sparse baselines.
arXiv Detail & Related papers (2024-06-01T13:10:35Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - RTP: Rethinking Tensor Parallelism with Memory Deduplication [3.036340414461332]
Rotated Parallelism (RTP) is an innovative approach that focuses on memory deduplication in distributed training environments.
Our empirical evaluations underscore RTP's efficiency, revealing that its memory consumption during distributed system training is remarkably close to the optimal.
arXiv Detail & Related papers (2023-11-02T23:12:42Z) - UniPT: Universal Parallel Tuning for Transfer Learning with Efficient
Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains.
Recent PETL works focus on the more valuable memory-efficient characteristic.
We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z) - ParaFormer: Parallel Attention Transformer for Efficient Feature
Matching [8.552303361149612]
This paper proposes a novel parallel attention model entitled ParaFormer.
It fuses features and keypoint positions through the concept of amplitude and phase, and integrates self- and cross-attention in a parallel manner.
Experiments on various applications, including homography estimation, pose estimation, and image matching, demonstrate that ParaFormer achieves state-of-the-art performance.
The efficient ParaFormer-U variant achieves comparable performance with less than 50% FLOPs of the existing attention-based models.
arXiv Detail & Related papers (2023-03-02T03:29:16Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - PyTorch Distributed: Experiences on Accelerating Data Parallel Training [11.393654219774444]
PyTorch is a widely-adopted scientific computing package used in deep learning research and applications.
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module.
arXiv Detail & Related papers (2020-06-28T20:39:45Z) - Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic
Circuits [99.59941892183454]
We propose Einsum Networks (EiNets), a novel implementation design for PCs.
At their core, EiNets combine a large number of arithmetic operations in a single monolithic einsum-operation.
We show that the implementation of Expectation-Maximization (EM) can be simplified for PCs, by leveraging automatic differentiation.
arXiv Detail & Related papers (2020-04-13T23:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.