Related papers: Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

URL: http://arxiv.org/abs/2110.10548v1
Date: Wed, 20 Oct 2021 13:05:49 GMT
Title: Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning
Authors: Ningning Xie, Tamara Norman, Dominik Grewe, Dimitrios Vytiniotis
Abstract summary: We present a novel characterization of the mapping of multiple parallelism forms onto hierarchical accelerator systems. We experimentally verify the substantial effect of these mappings on all-reduce performance (up to 448x) We offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives.
Score: 0.3345437353879254
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a novel characterization of the mapping of multiple parallelism forms (e.g. data and model parallelism) onto hierarchical accelerator systems that is hierarchy-aware and greatly reduces the space of software-to-hardware mapping. We experimentally verify the substantial effect of these mappings on all-reduce performance (up to 448x). We offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives in a hierarchy- and mapping-aware way. For 69% of parallelism placements and user requested reductions, our framework synthesizes programs that outperform the default all-reduce implementation when evaluated on different GPU hierarchies (max 2.04x, average 1.27x). We complement our synthesis tool with a simulator exceeding 90% top-10 accuracy, which therefore reduces the need for massive evaluations of synthesis results to determine a small set of optimal programs and mappings.

Related papers

Theoretical Guarantees for LT-TTD: A Unified Transformer-based Architecture for Two-Level Ranking Systems [0.0]
LT-TTD (Listwise Transformer with Two-Tower Distillation) is a novel unified architecture that bridges retrieval and ranking phases.<n>We show that LT-TTD reduces the upper limit on irretrievable relevant items by a factor that depends on the knowledge distillation strength.<n>We also introduce UPQE, a novel evaluation metric specifically designed for unified ranking architectures.
arXiv Detail & Related papers (2025-05-07T14:01:22Z)
Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach [6.449961842220686]
We propose a bi-level solution framework balancing optimality with computational efficiency. Our framework achieves comparable or superior performance, reducing computational bubbles by half under the same memory constraints. Such capabilities position our solution as both a valuable research tool for exploring optimal parallelization strategies and a practical industrial solution for large-scale AI deployment.
arXiv Detail & Related papers (2025-03-12T13:00:29Z)
A hierarchy of eigencomputations for polynomial optimization on the sphere [0.0]
We introduce a convergent hierarchy of lower bounds on the minimum value of a real form over the unit sphere. The main practical advantage of our hierarchy over the real sum-of-squares hierarchy is that the lower bound at each level is obtained by a minimum eigenvalue.
arXiv Detail & Related papers (2023-10-27T00:28:12Z)
Make Deep Networks Shallow Again [6.647569337929869]
A breakthrough has been achieved by the concept of residual connections. A stack of residual connection layers can be expressed as an expansion of terms similar to the Taylor expansion. In other words, a sequential deep architecture is substituted by a parallel shallow one.
arXiv Detail & Related papers (2023-09-15T14:18:21Z)
ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis [54.18659323181771]
We characterize several different forms of compositional generalization that are desirable in program synthesis. We propose ExeDec, a novel decomposition-based strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step.
arXiv Detail & Related papers (2023-07-26T01:07:52Z)
Parallel Tree Kernel Computation [0.0]
We devise a parallel implementation of the sequential algorithm for the computation of some tree kernels of two finite sets of trees. Results show that the proposed parallel algorithm outperforms the sequential one in terms of latency.
arXiv Detail & Related papers (2023-05-12T18:16:45Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform [15.606647290942563]
Rhino is a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration.
arXiv Detail & Related papers (2023-02-16T08:19:56Z)
Multi-Task Off-Policy Learning from Bandit Feedback [54.96011624223482]
We propose a hierarchical off-policy optimization algorithm (HierOPO), which estimates the parameters of the hierarchical model and then acts pessimistically with respect to them. We prove per-task bounds on the suboptimality of the learned policies, which show a clear improvement over not using the hierarchical model. Our theoretical and empirical results show a clear advantage of using the hierarchy over solving each task independently.
arXiv Detail & Related papers (2022-12-09T08:26:27Z)
Condensing Graphs via One-Step Gradient Matching [50.07587238142548]
We propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance.
arXiv Detail & Related papers (2022-06-15T18:20:01Z)
Matching Pursuit Based Scheduling for Over-the-Air Federated Learning [67.59503935237676]
This paper develops a class of low-complexity device scheduling algorithms for over-the-air learning via the method of federated learning. Compared to the state-of-the-art proposed scheme, the proposed scheme poses a drastically lower efficiency system. The efficiency of the proposed scheme is confirmed via experiments on the CIFAR dataset.
arXiv Detail & Related papers (2022-06-14T08:14:14Z)
Pruning-as-Search: Efficient Neural Architecture Search via Channel Pruning and Structural Reparameterization [50.50023451369742]
Pruning-as-Search (PaS) is an end-to-end channel pruning method to search out desired sub-network automatically and efficiently. Our proposed architecture outperforms prior arts by around $1.0%$ top-1 accuracy on ImageNet-1000 classification task.
arXiv Detail & Related papers (2022-06-02T17:58:54Z)
Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference [15.720414948573753]
We consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers) We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model. We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation.
arXiv Detail & Related papers (2020-08-19T06:44:41Z)
BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration [72.88493072196094]
We present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a set of input-output examples. We show that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches.
arXiv Detail & Related papers (2020-07-28T17:46:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.