A Transferable Approach for Partitioning Machine Learning Models on
Multi-Chip-Modules
- URL: http://arxiv.org/abs/2112.04041v1
- Date: Tue, 7 Dec 2021 23:40:28 GMT
- Title: A Transferable Approach for Partitioning Machine Learning Models on
Multi-Chip-Modules
- Authors: Xinfeng Xie, Prakash Prabhu, Ulysse Beaugnon, Phitchaya Mangpo
Phothilimthana, Sudip Roy, Azalia Mirhoseini, Eugene Brevdo, James Laudon,
Yanqi Zhou
- Abstract summary: Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine learning accelerators.
We present a strategy using a deep reinforcement learning framework to emit a possibly invalid candidate partition that is then corrected by a constraint solver.
Our evaluation of a production-scale model, BERT, on real hardware reveals that the partitioning generated using RL policy achieves 6.11% and 5.85% higher throughput.
- Score: 8.224904698490626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine
learning (ML) accelerators while delivering performance and energy efficiency
on par with a monolithic large chip. However, ML compilers targeting MCMs need
to solve complex optimization problems optimally and efficiently to achieve
this high performance. One such problem is the multi-chip partitioning problem
where compilers determine the optimal partitioning and placement of operations
in tensor computation graphs on chiplets in MCMs. Partitioning ML graphs for
MCMs is particularly hard as the search space grows exponentially with the
number of chiplets available and the number of nodes in the neural network.
Furthermore, the constraints imposed by the underlying hardware produce a
search space where valid solutions are extremely sparse. In this paper, we
present a strategy using a deep reinforcement learning (RL) framework to emit a
possibly invalid candidate partition that is then corrected by a constraint
solver. Using the constraint solver ensures that RL encounters valid solutions
in the sparse space frequently enough to converge with fewer samples as
compared to non-learned strategies. The architectural choices we make for the
policy network allow us to generalize across different ML graphs. Our
evaluation of a production-scale model, BERT, on real hardware reveals that the
partitioning generated using RL policy achieves 6.11% and 5.85% higher
throughput than random search and simulated annealing. In addition, fine-tuning
the pre-trained RL policy reduces the search time from 3 hours to only 9
minutes, while achieving the same throughput as training RL policy from
scratch.
Related papers
- Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models [16.16372459671255]
Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget.
We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM.
We show that trained routers operate differently from oracles and often yield suboptimal solutions.
arXiv Detail & Related papers (2024-10-01T16:10:21Z) - Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning.
deploying LLM inference poses challenges due to the high compute and memory requirements.
We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - Deep Model Predictive Optimization [21.22047409735362]
A major challenge in robotics is to design robust policies which enable complex and agile behaviors in the real world.
We propose Deep Model Predictive Optimization (DMPO), which learns the inner-loop of an MPC optimization algorithm directly via experience.
DMPO can outperform the best MPC algorithm by up to 27% with fewer samples and an end-to-end policy trained with MFRL by 19%.
arXiv Detail & Related papers (2023-10-06T21:11:52Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Combining Multi-Objective Bayesian Optimization with Reinforcement Learning for TinyML [4.2019872499238256]
We propose a novel strategy for deploying Deep Neural Networks on microcontrollers (TinyML) based on Multi-Objective Bayesian optimization (MOBOpt)
Our methodology aims at efficiently finding tradeoffs between a DNN's predictive accuracy, memory consumption on a given target system, and computational complexity.
arXiv Detail & Related papers (2023-05-23T14:31:52Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and
Personalized Federated Learning [56.17603785248675]
Model-agnostic meta-learning (MAML) has become a popular research area.
Existing MAML algorithms rely on the episode' idea by sampling a few tasks and data points to update the meta-model at each iteration.
This paper proposes memory-based algorithms for MAML that converge with vanishing error.
arXiv Detail & Related papers (2021-06-09T08:47:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.