Related papers: A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules

A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules

URL: http://arxiv.org/abs/2112.04041v1
Date: Tue, 7 Dec 2021 23:40:28 GMT
Title: A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules
Authors: Xinfeng Xie, Prakash Prabhu, Ulysse Beaugnon, Phitchaya Mangpo Phothilimthana, Sudip Roy, Azalia Mirhoseini, Eugene Brevdo, James Laudon, Yanqi Zhou
Abstract summary: Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine learning accelerators. We present a strategy using a deep reinforcement learning framework to emit a possibly invalid candidate partition that is then corrected by a constraint solver. Our evaluation of a production-scale model, BERT, on real hardware reveals that the partitioning generated using RL policy achieves 6.11% and 5.85% higher throughput.
Score: 8.224904698490626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine learning (ML) accelerators while delivering performance and energy efficiency on par with a monolithic large chip. However, ML compilers targeting MCMs need to solve complex optimization problems optimally and efficiently to achieve this high performance. One such problem is the multi-chip partitioning problem where compilers determine the optimal partitioning and placement of operations in tensor computation graphs on chiplets in MCMs. Partitioning ML graphs for MCMs is particularly hard as the search space grows exponentially with the number of chiplets available and the number of nodes in the neural network. Furthermore, the constraints imposed by the underlying hardware produce a search space where valid solutions are extremely sparse. In this paper, we present a strategy using a deep reinforcement learning (RL) framework to emit a possibly invalid candidate partition that is then corrected by a constraint solver. Using the constraint solver ensures that RL encounters valid solutions in the sparse space frequently enough to converge with fewer samples as compared to non-learned strategies. The architectural choices we make for the policy network allow us to generalize across different ML graphs. Our evaluation of a production-scale model, BERT, on real hardware reveals that the partitioning generated using RL policy achieves 6.11% and 5.85% higher throughput than random search and simulated annealing. In addition, fine-tuning the pre-trained RL policy reduces the search time from 3 hours to only 9 minutes, while achieving the same throughput as training RL policy from scratch.

Related papers

LOP: Learning Optimal Pruning for Efficient On-Demand MLLMs Scaling [52.1366057696919]
LOP is an efficient neural pruning framework that learns optimal pruning strategies from the target pruning constraint.<n>LOP approach trains autoregressive neural networks (NNs) to directly predict layer-wise pruning strategies adaptive to the target pruning constraint.<n> Experimental results show that LOP outperforms state-of-the-art pruning methods in various metrics while achieving up to three orders of magnitude speedup.
arXiv Detail & Related papers (2025-06-15T12:14:16Z)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models. Controlled Decoding provides a mechanism for aligning a model at inference time without retraining. We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation. Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z)
Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models [16.16372459671255]
Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget. We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM. We show that trained routers operate differently from oracles and often yield suboptimal solutions.
arXiv Detail & Related papers (2024-10-01T16:10:21Z)
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning. deploying LLM inference poses challenges due to the high compute and memory requirements. We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Deep Model Predictive Optimization [21.22047409735362]
A major challenge in robotics is to design robust policies which enable complex and agile behaviors in the real world. We propose Deep Model Predictive Optimization (DMPO), which learns the inner-loop of an MPC optimization algorithm directly via experience. DMPO can outperform the best MPC algorithm by up to 27% with fewer samples and an end-to-end policy trained with MFRL by 19%.
arXiv Detail & Related papers (2023-10-06T21:11:52Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
Combining Multi-Objective Bayesian Optimization with Reinforcement Learning for TinyML [4.2019872499238256]
We propose a novel strategy for deploying Deep Neural Networks on microcontrollers (TinyML) based on Multi-Objective Bayesian optimization (MOBOpt) Our methodology aims at efficiently finding tradeoffs between a DNN's predictive accuracy, memory consumption on a given target system, and computational complexity.
arXiv Detail & Related papers (2023-05-23T14:31:52Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
Towards Efficient Post-training Quantization of Pre-trained Language Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues. Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z)
Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and Personalized Federated Learning [56.17603785248675]
Model-agnostic meta-learning (MAML) has become a popular research area. Existing MAML algorithms rely on the episode' idea by sampling a few tasks and data points to update the meta-model at each iteration. This paper proposes memory-based algorithms for MAML that converge with vanishing error.
arXiv Detail & Related papers (2021-06-09T08:47:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.