Related papers: Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

URL: http://arxiv.org/abs/2309.16976v1
Date: Fri, 29 Sep 2023 04:49:35 GMT
Title: Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors
Authors: Chengming Zhang, Baixi Sun, Xiaodong Yu, Zhen Xie, Weijian Zheng, Kamil Iskra, Pete Beckman, Dingwen Tao
Abstract summary: Transformer models have achieved remarkable success in various machine learning tasks but suffer from high computational complexity and resource requirements. Specialized AI hardware accelerators, such as the Habana GAUDI architecture, offer a promising solution to tackle these issues. This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process.
Score: 5.432613942292548
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer models have achieved remarkable success in various machine learning tasks but suffer from high computational complexity and resource requirements. The quadratic complexity of the self-attention mechanism further exacerbates these challenges when dealing with long sequences and large datasets. Specialized AI hardware accelerators, such as the Habana GAUDI architecture, offer a promising solution to tackle these issues. GAUDI features a Matrix Multiplication Engine (MME) and a cluster of fully programmable Tensor Processing Cores (TPC). This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process. Firstly, we provide a comprehensive performance comparison between the MME and TPC components, illuminating their relative strengths and weaknesses. Secondly, we explore strategies to optimize MME and TPC utilization, offering practical insights to enhance computational efficiency. Thirdly, we evaluate the performance of Transformers on GAUDI, particularly in handling long sequences and uncovering performance bottlenecks. Lastly, we evaluate the end-to-end performance of two Transformer-based large language models (LLM) on GAUDI. The contributions of this work encompass practical insights for practitioners and researchers alike. We delve into GAUDI's capabilities for Transformers through systematic profiling, analysis, and optimization exploration. Our study bridges a research gap and offers a roadmap for optimizing Transformer-based model training on the GAUDI architecture.

Related papers

On the locality bias and results in the Long Range Arena [49.15148871877941]
The Long Range Arena benchmark was designed to evaluate the performance of Transformer improvements. A new series of architectures such as State Space Models (SSMs) gained some traction, greatly outperforming Transformers in the LRA. We show that while the LRA is a benchmark for long-range dependency modeling, in reality most of the performance comes from short-range dependencies.
arXiv Detail & Related papers (2025-01-24T15:34:50Z)
GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors [5.432613942292548]
Heterogeneous hardware like Gaudi processor has been developed to enhance computations. Transformers are not fully optimized on such emerging hardware. integrated approach (called GFormer) that merges sparse linear attention mechanisms. GFormer significantly improves efficiency and model performance across various tasks on the Gaudi processor.
arXiv Detail & Related papers (2024-12-19T14:50:11Z)
A Survey on Inference Optimization Techniques for Mixture of Experts Models [50.40325411764262]
Large-scale Mixture of Experts (MoE) models offer enhanced model capacity and computational efficiency through conditional computation. deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency. This survey analyzes optimization techniques for MoE models across the entire system stack.
arXiv Detail & Related papers (2024-12-18T14:11:15Z)
Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI. As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning. We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z)
A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators [2.88634411143577]
Large language models (LLMs) are being considered as a promising approach to address some of the challenging problems. Specialized AI accelerator hardware systems have recently become available for accelerating AI applications.
arXiv Detail & Related papers (2023-10-06T21:55:57Z)
A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking [19.65897437342896]
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. This paper mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios.
arXiv Detail & Related papers (2023-09-05T08:21:16Z)
End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
Optimizing Inference Performance of Transformers on CPUs [0.0]
Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. This paper presents an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs.
arXiv Detail & Related papers (2021-02-12T17:01:35Z)
A Learned Performance Model for Tensor Processing Units [5.733911161090224]
We demonstrate a method of learning performance models from a corpus of graph programs for Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks. It helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
arXiv Detail & Related papers (2020-08-03T17:24:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.