Related papers: NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium

NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium

URL: http://arxiv.org/abs/2510.25977v2
Date: Fri, 31 Oct 2025 01:52:13 GMT
Title: NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium
Authors: Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su, Dong Li,
Abstract summary: We design high-performance matmul, a critical compute kernel, for LLM inference on Trainium.<n>We show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium.
Score: 4.7520621855466425
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.

Related papers

To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training [17.090117647151708]
We show that we can leverage hardware-accelerated sparsity to accelerate all matrix multiplications in the Feed Forward Network (FFN)<n>Our recipe relies on sparse training steps to accelerate a large part of the pretraining, associated with regular dense training steps towards the end.<n>Models trained with this approach exhibit the same performance on our quality benchmarks, and can speed up training end-to-end by 1.4 to 1.7x.
arXiv Detail & Related papers (2026-02-05T20:43:24Z)
RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure [49.88201789074532]
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning.<n>We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure.
arXiv Detail & Related papers (2025-12-27T11:14:23Z)
AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines [3.4381029715186844]
AIE4ML is a framework for converting AI models automatically into optimized firmware targeting the AIE-ML generation devices.<n>We achieve up to 98.6% efficiency relative to the single- kernel baseline.<n>With evaluations across real-world model topologies, we demonstrate that AIE4ML delivers GPU-class throughput under microsecond latency constraints.
arXiv Detail & Related papers (2025-12-17T20:13:05Z)
How to Train Your LLM Web Agent: A Statistical Diagnosis [96.86317871461834]
We present the first statistically grounded study on compute allocation for LLM web-agent post-training.<n>Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT) and on-policy reinforcement learning.<n>Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++.
arXiv Detail & Related papers (2025-07-05T17:12:33Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels [12.77187564450236]
We introduce XY-Serve, a versatile, Ascend native, end-to-end production large language model (LLM) serving system.<n>The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into fine-grained meta primitives.<n>For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes.
arXiv Detail & Related papers (2024-12-24T02:27:44Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers [16.253898272659242]
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs) We show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.
arXiv Detail & Related papers (2024-06-24T08:43:21Z)
HLAT: High-quality Large Language Model Pre-trained on AWS Trainium [21.183733616898365]
This paper showcases a family of 7B and 70B decoder-only LLMs pre-trained using 4096 AWS Trainium accelerators over 1.8 trillion tokens. We show that HLAT achieves model quality on par with the baselines of similar model size.
arXiv Detail & Related papers (2024-04-16T15:02:46Z)
R2GenGPT: Radiology Report Generation with Frozen LLMs [47.72270349660438]
R2GenGPT is a novel solution that aligns visual features with the word embedding space of LLMs. R2GenGPT attains state-of-the-art (SOTA) performance by training only the lightweight visual alignment module. Our model only trains 5M parameters to achieve performance close to the SOTA levels.
arXiv Detail & Related papers (2023-09-18T14:35:35Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models [18.63017668881868]
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems.
arXiv Detail & Related papers (2021-04-12T02:15:55Z)
One-step regression and classification with crosspoint resistive memory arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge. One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition. Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.