PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- URL: http://arxiv.org/abs/2304.11277v2
- Date: Tue, 12 Sep 2023 16:28:00 GMT
- Title: PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- Authors: Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min
Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison,
Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit
Mathews and Shen Li
- Abstract summary: We introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training.
FSDP provides support for significantly larger models with near-linear scalability in terms of TFLOPS.
- Score: 19.24542340170026
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: It is widely acknowledged that large models have the potential to deliver
superior performance across a broad range of domains. Despite the remarkable
progress made in the field of machine learning systems research, which has
enabled the development and exploration of large models, such abilities remain
confined to a small group of advanced users and industry leaders, resulting in
an implicit technical barrier for the wider community to access and leverage
these technologies. In this paper, we introduce PyTorch Fully Sharded Data
Parallel (FSDP) as an industry-grade solution for large model training. FSDP
has been closely co-designed with several key PyTorch core components including
Tensor implementation, dispatcher system, and CUDA memory caching allocator, to
provide non-intrusive user experiences and high training efficiency.
Additionally, FSDP natively incorporates a range of techniques and settings to
optimize resource utilization across a variety of hardware configurations. The
experimental results demonstrate that FSDP is capable of achieving comparable
performance to Distributed Data Parallel while providing support for
significantly larger models with near-linear scalability in terms of TFLOPS.
Related papers
- SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile [7.544642148576768]
SimpleFSDP is a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework.
It has a simple implementation for maintenance and computationsability, allows full compo-communication graph tracing, and brings performance enhancement via compiler backend optimizations.
It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping.
arXiv Detail & Related papers (2024-11-01T00:43:54Z) - Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research [90.91438597133211]
We introduce WarpSci, a framework designed to overcome crucial system bottlenecks in the application of reinforcement learning.
We eliminate the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations.
arXiv Detail & Related papers (2024-08-01T21:38:09Z) - fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence [50.417261057533786]
fVDB is a novel framework for deep learning on large-scale 3D data.
Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines.
arXiv Detail & Related papers (2024-07-01T20:20:33Z) - DPOT: Auto-Regressive Denoising Operator Transformer for Large-Scale PDE Pre-Training [87.90342423839876]
We present a new auto-regressive denoising pre-training strategy, which allows for more stable and efficient pre-training on PDE data.
We train our PDE foundation model with up to 0.5B parameters on 10+ PDE datasets with more than 100k trajectories.
arXiv Detail & Related papers (2024-03-06T08:38:34Z) - SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models [28.764782216513037]
Federated Learning (FL) can benefit from distributed and private data of the FL edge clients for fine-tuning.
We propose a method called SLoRA, which overcomes the key limitations of LoRA in high heterogeneous data scenarios.
Our experimental results demonstrate that SLoRA achieves performance comparable to full fine-tuning.
arXiv Detail & Related papers (2023-08-12T10:33:57Z) - Pointerformer: Deep Reinforced Multi-Pointer Transformer for the
Traveling Salesman Problem [67.32731657297377]
Traveling Salesman Problem (TSP) is a classic routing optimization problem originally arising in the domain of transportation and logistics.
Recently, Deep Reinforcement Learning has been increasingly employed to solve TSP due to its high inference efficiency.
We propose a novel end-to-end DRL approach, referred to as Pointerformer, based on multi-pointer Transformer.
arXiv Detail & Related papers (2023-04-19T03:48:32Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Hardware-Efficient Deconvolution-Based GAN for Edge Computing [1.5229257192293197]
Generative Adversarial Networks (GAN) are cutting-edge algorithms for generating new data samples based on the learned data distribution.
We proposed an HW/SW co-design approach for training quantized deconvolution GAN (QDCGAN) implemented on FPGA using a scalable streaming dataflow architecture.
Various precisions, datasets, and network scalability were analyzed for low-power inference on resource-constrained platforms.
arXiv Detail & Related papers (2022-01-18T11:16:59Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - High-performance, Distributed Training of Large-scale Deep Learning
Recommendation Models [18.63017668881868]
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook.
In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs.
We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems.
arXiv Detail & Related papers (2021-04-12T02:15:55Z) - Towards a Scalable and Distributed Infrastructure for Deep Learning
Applications [4.4979162962108905]
Phylanx offers a productivity-oriented execution tree that can be executed on multiple nodes.
We present Phylanx that has the potential to alleviate shortcomings in distributed deep learning frameworks.
arXiv Detail & Related papers (2020-10-06T20:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.