A Vertex Cut based Framework for Load Balancing and Parallelism
Optimization in Multi-core Systems
- URL: http://arxiv.org/abs/2010.04414v1
- Date: Fri, 9 Oct 2020 07:54:28 GMT
- Title: A Vertex Cut based Framework for Load Balancing and Parallelism
Optimization in Multi-core Systems
- Authors: Guixiang Ma, Yao Xiao, Theodore L. Willke, Nesreen K. Ahmed, Shahin
Nazarian, Paul Bogdan
- Abstract summary: High-level applications, such as machine learning, are evolving from simple models based on multilayer perceptrons for simple image recognition to much deeper and more complex neural networks for self-driving vehicle control systems.
Parallel programs running on high-performance computers often suffer from data communication bottlenecks, limited memory bandwidth, and synchronization overhead due to irregular critical sections.
We propose a framework to reduce the data communication and improve the scalability and performance of these applications in multi-core systems.
- Score: 15.913119724815733
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-level applications, such as machine learning, are evolving from simple
models based on multilayer perceptrons for simple image recognition to much
deeper and more complex neural networks for self-driving vehicle control
systems.The rapid increase in the consumption of memory and computational
resources by these models demands the use of multi-core parallel systems to
scale the execution of the complex emerging applications that depend on them.
However, parallel programs running on high-performance computers often suffer
from data communication bottlenecks, limited memory bandwidth, and
synchronization overhead due to irregular critical sections. In this paper, we
propose a framework to reduce the data communication and improve the
scalability and performance of these applications in multi-core systems. We
design a vertex cut framework for partitioning LLVM IR graphs into clusters
while taking into consideration the data communication and workload balance
among clusters. First, we construct LLVM graphs by compiling high-level
programs into LLVM IR, instrumenting code to obtain the execution order of
basic blocks and the execution time for each memory operation, and analyze data
dependencies in dynamic LLVM traces. Next, we formulate the problem as Weight
Balanced $p$-way Vertex Cut, and propose a generic and flexible framework,
wherein four different greedy algorithms are proposed for solving this problem.
Lastly, we propose a memory-centric run-time mapping of the linear time
complexity to map clusters generated from the vertex cut algorithms onto a
multi-core platform. We conclude that our best algorithm, WB-Libra, provides
performance improvements of 1.56x and 1.86x over existing state-of-the-art
approaches for 8 and 1024 clusters running on a multi-core platform,
respectively.
Related papers
- Stochastic Communication Avoidance for Recommendation Systems [27.616664288148232]
We propose a theoretical framework that analyses the communication costs of arbitrary distributed systems that use lookup tables.
We use this framework to propose algorithms that maximize throughput subject to memory, computation, and communication constraints.
We implement our framework and algorithms in PyTorch and achieve up to 6x increases in training throughput on GPU systems over baselines.
arXiv Detail & Related papers (2024-11-03T15:37:37Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Support Vector Machine Implementation on MPI-CUDA and Tensorflow
Framework [0.0]
Support Vector Machine (SVM) algorithm requires a high computational cost to solve a complex quadratic programming (QP) optimization problem.
parallel multi-architecture, available in both multi-core CPUs and highly scalable GPU, emerges as a promising solution to enhance algorithm performance.
This paper achieves a comparative study that implements the SVM algorithm on different parallel architecture frameworks.
arXiv Detail & Related papers (2023-11-25T02:52:37Z) - Memory-aware Scheduling for Complex Wired Networks with Iterative Graph
Optimization [4.614780125575351]
We propose an efficient memory-aware scheduling framework based on iterative graph optimization.
Our framework features an iterative graph fusion algorithm that simplifies the graph while preserving the scheduling optimality.
arXiv Detail & Related papers (2023-08-26T14:52:02Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Scalable Graph Convolutional Network Training on Distributed-Memory
Systems [5.169989177779801]
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs.
Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges.
We propose a highly parallel training algorithm that scales to large processor counts.
arXiv Detail & Related papers (2022-12-09T17:51:13Z) - Late Fusion Multi-view Clustering via Global and Local Alignment
Maximization [61.89218392703043]
Multi-view clustering (MVC) optimally integrates complementary information from different views to improve clustering performance.
Most of existing approaches directly fuse multiple pre-specified similarities to learn an optimal similarity matrix for clustering.
We propose late fusion MVC via alignment to address these issues.
arXiv Detail & Related papers (2022-08-02T01:49:31Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core
Platforms With Network-on-Chip Interconnect [0.0764671395172401]
Machine intelligence, especially using convolutional neural networks (CNNs), has become a large area of research over the past years.
Many-core platforms consisting of several homogeneous cores can alleviate limitations with regard to physical implementation at the expense of an increased dataflow mapping effort.
This work presents an automated mapping strategy starting at the single-core level with different optimization targets for minimal runtime and minimal off-chip memory accesses.
The strategy is then extended towards a suitable many-core mapping scheme and evaluated using a scalable system-level simulation with a network-on-chip interconnect.
arXiv Detail & Related papers (2020-06-18T17:13:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.