Related papers: Memory-efficient array redistribution through portable collective communication

Memory-efficient array redistribution through portable collective communication

URL: http://arxiv.org/abs/2112.01075v1
Date: Thu, 2 Dec 2021 09:32:07 GMT
Title: Memory-efficient array redistribution through portable collective communication
Authors: Norman A. Rink, Adam Paszke, Dimitrios Vytiniotis, Georg Stefan Schmid
Abstract summary: We present a type-directed approach to synthesizing array redistributions as sequences of MPI-style collective operations. We prove formally that our synthesized redistributions are memory-efficient and perform no excessive data transfers. We evaluate our approach against the XLA implementation and find that our approach delivers a geometric mean speedup of $1.22times$, with maximum speedups as a high as $5.7times$.
Score: 0.4096453902709291
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern large-scale deep learning workloads highlight the need for parallel execution across many devices in order to fit model data into hardware accelerator memories. In these settings, array redistribution may be required during a computation, but can also become a bottleneck if not done efficiently. In this paper we address the problem of redistributing multi-dimensional array data in SPMD computations, the most prevalent form of parallelism in deep learning. We present a type-directed approach to synthesizing array redistributions as sequences of MPI-style collective operations. We prove formally that our synthesized redistributions are memory-efficient and perform no excessive data transfers. Array redistribution for SPMD computations using collective operations has also been implemented in the context of the XLA SPMD partitioner, a production-grade tool for partitioning programs across accelerator systems. We evaluate our approach against the XLA implementation and find that our approach delivers a geometric mean speedup of $1.22\times$, with maximum speedups as a high as $5.7\times$, while offering provable memory guarantees, making our system particularly appealing for large-scale models.

Related papers

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization [2.2596489829928452]
This work generalizes work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system.
arXiv Detail & Related papers (2025-01-13T17:56:39Z)
COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators [6.172271429579593]
We propose a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip.
arXiv Detail & Related papers (2025-01-12T11:31:25Z)
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference. We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization. We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z)
AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs [57.12929098407975]
We show that by efficiently parallelizing existing causal discovery methods, we can scale them to thousands of dimensions. Specifically, we focus on the causal ordering subprocedure in DirectLiNGAM and implement GPU kernels to accelerate it. This allows us to apply DirectLiNGAM to causal inference on large-scale gene expression data with genetic interventions yielding competitive results.
arXiv Detail & Related papers (2024-03-06T15:06:11Z)
Decreasing the Computing Time of Bayesian Optimization using Generalizable Memory Pruning [56.334116591082896]
We show a wrapper of memory pruning and bounded optimization capable of being used with any surrogate model and acquisition function. Running BO on high-dimensional or massive data sets becomes intractable due to this time complexity. All model implementations are run on the MIT Supercloud state-of-the-art computing hardware.
arXiv Detail & Related papers (2023-09-08T14:05:56Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
SplitBrain: Hybrid Data and Model Parallel Deep Learning [11.63431725146897]
This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism. Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers. Results show that SplitBrain can achieve nearly linear speedup while saving up to 67% of memory consumption for data and model parallel VGG over CIFAR-10.
arXiv Detail & Related papers (2021-12-31T06:25:38Z)
MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping. We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations. Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z)
HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data Analytics [0.0]
HeAT is an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI. When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.
arXiv Detail & Related papers (2020-07-27T13:33:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.