Related papers: A Vertex Cut based Framework for Load Balancing and Parallelism Optimization in Multi-core Systems

A Vertex Cut based Framework for Load Balancing and Parallelism Optimization in Multi-core Systems

URL: http://arxiv.org/abs/2010.04414v1
Date: Fri, 9 Oct 2020 07:54:28 GMT
Title: A Vertex Cut based Framework for Load Balancing and Parallelism Optimization in Multi-core Systems
Authors: Guixiang Ma, Yao Xiao, Theodore L. Willke, Nesreen K. Ahmed, Shahin Nazarian, Paul Bogdan
Abstract summary: High-level applications, such as machine learning, are evolving from simple models based on multilayer perceptrons for simple image recognition to much deeper and more complex neural networks for self-driving vehicle control systems. Parallel programs running on high-performance computers often suffer from data communication bottlenecks, limited memory bandwidth, and synchronization overhead due to irregular critical sections. We propose a framework to reduce the data communication and improve the scalability and performance of these applications in multi-core systems.
Score: 15.913119724815733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-level applications, such as machine learning, are evolving from simple models based on multilayer perceptrons for simple image recognition to much deeper and more complex neural networks for self-driving vehicle control systems.The rapid increase in the consumption of memory and computational resources by these models demands the use of multi-core parallel systems to scale the execution of the complex emerging applications that depend on them. However, parallel programs running on high-performance computers often suffer from data communication bottlenecks, limited memory bandwidth, and synchronization overhead due to irregular critical sections. In this paper, we propose a framework to reduce the data communication and improve the scalability and performance of these applications in multi-core systems. We design a vertex cut framework for partitioning LLVM IR graphs into clusters while taking into consideration the data communication and workload balance among clusters. First, we construct LLVM graphs by compiling high-level programs into LLVM IR, instrumenting code to obtain the execution order of basic blocks and the execution time for each memory operation, and analyze data dependencies in dynamic LLVM traces. Next, we formulate the problem as Weight Balanced $p$-way Vertex Cut, and propose a generic and flexible framework, wherein four different greedy algorithms are proposed for solving this problem. Lastly, we propose a memory-centric run-time mapping of the linear time complexity to map clusters generated from the vertex cut algorithms onto a multi-core platform. We conclude that our best algorithm, WB-Libra, provides performance improvements of 1.56x and 1.86x over existing state-of-the-art approaches for 8 and 1024 clusters running on a multi-core platform, respectively.

Related papers

Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
Understanding and Optimizing Multi-Stage AI Inference Pipelines [11.254219071373319]
HERMES is a Heterogeneous Multi-stage LLM inference Execution Simulator. HERMES supports heterogeneous clients executing multiple models concurrently unlike prior frameworks. We explore the impact of reasoning stages on end-to-end latency, optimal strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval.
arXiv Detail & Related papers (2025-04-14T00:29:49Z)
Stochastic Communication Avoidance for Recommendation Systems [27.616664288148232]
We propose a theoretical framework that analyses the communication costs of arbitrary distributed systems that use lookup tables. We use this framework to propose algorithms that maximize throughput subject to memory, computation, and communication constraints. We implement our framework and algorithms in PyTorch and achieve up to 6x increases in training throughput on GPU systems over baselines.
arXiv Detail & Related papers (2024-11-03T15:37:37Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Support Vector Machine Implementation on MPI-CUDA and Tensorflow Framework [0.0]
Support Vector Machine (SVM) algorithm requires a high computational cost to solve a complex quadratic programming (QP) optimization problem. parallel multi-architecture, available in both multi-core CPUs and highly scalable GPU, emerges as a promising solution to enhance algorithm performance. This paper achieves a comparative study that implements the SVM algorithm on different parallel architecture frameworks.
arXiv Detail & Related papers (2023-11-25T02:52:37Z)
Memory-aware Scheduling for Complex Wired Networks with Iterative Graph Optimization [4.614780125575351]
We propose an efficient memory-aware scheduling framework based on iterative graph optimization. Our framework features an iterative graph fusion algorithm that simplifies the graph while preserving the scheduling optimality.
arXiv Detail & Related papers (2023-08-26T14:52:02Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
Scalable Graph Convolutional Network Training on Distributed-Memory Systems [5.169989177779801]
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs. Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges. We propose a highly parallel training algorithm that scales to large processor counts.
arXiv Detail & Related papers (2022-12-09T17:51:13Z)
Late Fusion Multi-view Clustering via Global and Local Alignment Maximization [61.89218392703043]
Multi-view clustering (MVC) optimally integrates complementary information from different views to improve clustering performance. Most of existing approaches directly fuse multiple pre-specified similarities to learn an optimal similarity matrix for clustering. We propose late fusion MVC via alignment to address these issues.
arXiv Detail & Related papers (2022-08-02T01:49:31Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core Platforms With Network-on-Chip Interconnect [0.0764671395172401]
Machine intelligence, especially using convolutional neural networks (CNNs), has become a large area of research over the past years. Many-core platforms consisting of several homogeneous cores can alleviate limitations with regard to physical implementation at the expense of an increased dataflow mapping effort. This work presents an automated mapping strategy starting at the single-core level with different optimization targets for minimal runtime and minimal off-chip memory accesses. The strategy is then extended towards a suitable many-core mapping scheme and evaluated using a scalable system-level simulation with a network-on-chip interconnect.
arXiv Detail & Related papers (2020-06-18T17:13:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.