Related papers: SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

URL: http://arxiv.org/abs/2405.07518v2
Date: Tue, 05 Nov 2024 02:53:00 GMT
Title: SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
Authors: Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi, Yun Du, Mingran Wang, Xiangyu Song, Kejie Zhang, Tianren Gao, Angela Wang, Karen Li, Yongning Sheng, Joshua Brot, Denis Sokolov, Apurv Vivek, Calvin Leung, Arjun Sabnis, Jiayu Bai, Tuowen Zhao, Mark Gottscho, David Jackson, Mark Luttrell, Manish K. Shah, Edison Chen, Kaizhao Liang, Swayambhoo Jain, Urmish Thakker, Dawei Huang, Sumti Jairath, Kevin J. Brown, Kunle Olukotun,
Abstract summary: Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving.
Score: 9.94373711477696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2$\times$ to 13$\times$ on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19$\times$, speeds up model switching time by 15$\times$ to 31$\times$, and achieves an overall speedup of 3.7$\times$ over a DGX H100 and 6.6$\times$ over a DGX A100.

Related papers

FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design [40.39807270881305]
We present FlexLLM, a composable High-Level Synthesis library for rapid development of domain-specific LLM accelerators.<n>We build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code.<n>On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$times$ end-to-end speedup, 1.64$times$ higher decode throughput, and 3.14$times$ better energy efficiency than an NVIDIA A100 GPU running BF16 inference.
arXiv Detail & Related papers (2026-01-22T07:31:51Z)
SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations [54.303301888915406]
Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost.<n>We propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching.<n>We also propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels.
arXiv Detail & Related papers (2025-12-16T04:39:10Z)
dInfer: An Efficient Inference Framework for Diffusion Language Models [54.80918957287927]
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs.<n>We present dInfer, an efficient and framework for dLLM inference.
arXiv Detail & Related papers (2025-10-09T16:19:42Z)
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z)
Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models [50.260693393896716]
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but constrained by high computational costs.<n>We propose Flexiffusion, a training-free NAS framework that jointly optimize generation schedules and model architectures without modifying pre-trained parameters.<n>Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
arXiv Detail & Related papers (2025-06-03T06:02:50Z)
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models [11.472831744634156]
State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. We present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones.
arXiv Detail & Related papers (2025-03-28T21:10:39Z)
Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching [35.83447642182576]
Large Language Models (LLMs) have demonstrated remarkable capabilities. LLMs' deployment the main part of carbon emission from nowadays AI applications. This paper proposes a model modularization algorithm to enable LLM inference on outdated hardware.
arXiv Detail & Related papers (2024-10-17T08:33:39Z)
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning [49.997801914237094]
We introduce the Fire-Flyer AI- HPC architecture, a synergistic hardware-software co-design framework and its best practices. For Deep Learning (DL) training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication.
arXiv Detail & Related papers (2024-08-26T10:11:56Z)
AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures. We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z)
ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with Decoupled Asymmetric Convolution [0.0502254944841629]
Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth. This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN.
arXiv Detail & Related papers (2023-08-30T07:23:32Z)
Fully $1\times1$ Convolutional Network for Lightweight Image Super-Resolution [79.04007257606862]
Deep models have significant process on single image super-resolution (SISR) tasks, in particular large models with large kernel ($3times3$ or more) $1times1$ convolutions bring substantial computational efficiency, but struggle with aggregating local spatial representations. We propose a simple yet effective fully $1times1$ convolutional network, named Shift-Conv-based Network (SCNet)
arXiv Detail & Related papers (2023-07-30T06:24:03Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference [7.743308058511418]
We provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT) We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing.
arXiv Detail & Related papers (2023-03-10T19:30:15Z)
RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems [68.8204255655161]
We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP. RAMP supports large-scale distributed and parallel computing systems (12.8Tbps per node for up to 65,536 nodes.
arXiv Detail & Related papers (2022-11-28T11:24:51Z)
Vau da muntanialas: Energy-efficient multi-die scalable acceleration of RNN inference [18.50014427283814]
We present Muntaniala, an RNN accelerator architecture for LSTM inference with a silicon-measured energy-efficiency of 3.25$TOP/s/W$. The scalable design of Muntaniala allows running large RNN models by combining multiple tiles in a systolic array. We show a phoneme error rate (PER) drop of approximately 3% with respect to floating-point (FP) on a 3L-384NH-123NI LSTM network.
arXiv Detail & Related papers (2022-02-14T09:21:16Z)
Training Large Neural Networks with Constant Memory using a New Execution Algorithm [0.5424799109837065]
We introduce a new relay-style execution technique called L2L (layer-to-layer) L2L is able to fit models up to 50 Billion parameters on a machine with a single 16GB V100 and 512GB CPU memory.
arXiv Detail & Related papers (2020-02-13T17:29:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.