On Extending Amdahl's law to Learn Computer Performance
- URL: http://arxiv.org/abs/2110.07822v1
- Date: Fri, 15 Oct 2021 02:37:07 GMT
- Title: On Extending Amdahl's law to Learn Computer Performance
- Authors: Chaitanya Poolla, Rahul Saxena
- Abstract summary: The problem of learning parallel computer performance is investigated in the context of multicore processors.
We propose to extend Amdahl's law to accommodate multiple resources into the overall speedup equation.
We transform the speedup equation into a multivariable regression problem suitable for machine learning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The problem of learning parallel computer performance is investigated in the
context of multicore processors. Given a fixed workload, the effect of varying
system configuration on performance is sought. Conventionally, the performance
speedup due to a single resource enhancement is formulated using Amdahl's law.
However, in case of multiple configurable resources the conventional
formulation results in several disconnected speedup equations that cannot be
combined together to determine the overall speedup. To solve this problem, we
propose to (1) extend Amdahl's law to accommodate multiple configurable
resources into the overall speedup equation, and (2) transform the speedup
equation into a multivariable regression problem suitable for machine learning.
Using experimental data from two benchmarks (SPECCPU 2017 and PCMark 10) and
four hardware platforms (Intel Xeon 8180M, AMD EPYC 7702P, Intel CoffeeLake
8700K, and AMD Ryzen 3900X), analytical models are developed and
cross-validated. Findings indicate that in most cases, the models result in an
average cross-validated accuracy higher than 95%, thereby validating the
proposed extension of Amdahl's law. The proposed methodology enables rapid
generation of intelligent analytical models to support future industrial
development, optimization, and simulation needs.
Related papers
- Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - A 10.8mW Mixed-Signal Simulated Bifurcation Ising Solver using SRAM Compute-In-Memory with 0.6us Time-to-Solution [1.534667887016089]
This work presents a simulated bifurcation (SB) Ising solver in CMOS for NP-hard optimization problems.
Our chip achieves an order of magnitude improvement in time-to-solution and power compared to previously proposed Ising solvers in CMOS.
arXiv Detail & Related papers (2025-04-14T16:28:14Z) - Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.
Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.
We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions.
While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly.
We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z) - Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z) - Towards making the most of NLP-based device mapping optimization for
OpenCL kernels [5.6596607119831575]
We extend the work of Cummins et al., namely Deeptune, that tackles the problem of optimal device selection ( CPU or GPU) for accelerated OpenCL kernels.
We propose four different models that provide enhanced contextual information of source codes.
Experimental results show that our proposed methodology surpasses that of Cummins et al. work, providing up to 4% improvement in prediction accuracy.
arXiv Detail & Related papers (2022-08-30T10:20:55Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Large Batch Simulation for Deep Reinforcement Learning [101.01408262583378]
We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work.
We realize end-to-end training speeds of over 19,000 frames of experience per second on a single and up to 72,000 frames per second on a single eight- GPU machine.
By combining batch simulation and performance optimizations, we demonstrate that Point navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system.
arXiv Detail & Related papers (2021-03-12T00:22:50Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z) - Optimizing Streaming Parallelism on Heterogeneous Many-Core
Architectures: A Machine Learning Based Approach [16.702537371391053]
This article presents an automatic approach to derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures.
Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration.
Compared to the single-stream version, our approach achieves a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively.
arXiv Detail & Related papers (2020-03-05T21:18:21Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.