Related papers: Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs

Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs

URL: http://arxiv.org/abs/2602.10377v1
Date: Tue, 10 Feb 2026 23:51:00 GMT
Title: Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs
Authors: Luoyang Sun, Jiwen Jiang, Yifeng Ding, Fengfa Li, Yan Song, Haifeng Zhang, Jian Ying, Lei Ren, Kun Zhan, Wei Chen, Yan Xie, Cheng Deng,
Abstract summary: We propose a hardware co-design law that captures model accuracy and inference performance.<n>We empirically evaluate 1,942 candidate architectures on NVIDIA Jetson Orin.<n>Our architecture achieves 19.42% lower perplexity on WikiText-2.
Score: 49.99513618431772
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action Models (VLAs) have emerged as a key paradigm of Physical AI and are increasingly deployed in autonomous vehicles, robots, and smart spaces. In these resource-constrained on-device settings, selecting an appropriate large language model (LLM) backbone is a critical challenge: models must balance accuracy with strict inference latency and hardware efficiency constraints. This makes hardware-software co-design a game-changing requirement for on-device LLM deployment, where each hardware platform demands a tailored architectural solution. We propose a hardware co-design law that jointly captures model accuracy and inference performance. Specifically, we model training loss as an explicit function of architectural hyperparameters and characterise inference latency via roofline modelling. We empirically evaluate 1,942 candidate architectures on NVIDIA Jetson Orin, training 170 selected models for 10B tokens each to fit a scaling law relating architecture to training loss. By coupling this scaling law with latency modelling, we establish a direct accuracy-latency correspondence and identify the Pareto frontier for hardware co-designed LLMs. We further formulate architecture search as a joint optimisation over precision and performance, deriving feasible design regions under industrial hardware and application budgets. Our approach reduces architecture selection from months to days. At the same latency as Qwen2.5-0.5B on the target hardware, our co-designed architecture achieves 19.42% lower perplexity on WikiText-2. To our knowledge, this is the first principled and operational framework for hardware co-design scaling laws in on-device LLM deployment. We will make the code and related checkpoints publicly available.

Related papers

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models [78.73992315826035]
We introduce Youtu-LLM, a lightweight language model that harmonizes high computational efficiency with native agentic intelligence.<n>Youtu-LLM is pre-trained from scratch to systematically cultivate reasoning and planning capabilities.
arXiv Detail & Related papers (2025-12-31T04:25:11Z)
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z)
Advancing AI-assisted Hardware Design with Hierarchical Decentralized Training and Personalized Inference-Time Optimization [3.29494205026308]
Large Language Models (LLMs) have sparked significant interest in AI-assisted hardware design generation.<n>We identify three critical challenges hindering the development of LLM-assisted hardware design generation.<n>This paper introduces a two-stage framework for AI-assisted hardware design by exploring decentralized training and personalized inference.
arXiv Detail & Related papers (2025-04-21T15:41:28Z)
Neural Architecture Codesign for Fast Physics Applications [0.8692847090818803]
We develop a pipeline to streamline neural architecture codesign for physics applications.<n>We employ neural architecture search and network compression in a two-stage approach to discover hardware efficient models.
arXiv Detail & Related papers (2025-01-09T19:00:03Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs) This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z)
Demystifying AI Platform Design for Distributed Inference of Next-Generation LLM models [8.02264001053969]
Large language models (LLMs) have shown remarkable performance across a wide range of applications, often outperforming human experts.<n>With constant innovation in LLM serving optimizations and model architecture evolving at breakneck speed, the hardware requirements to meet Service Level Objectives (SLOs) remain an open research question.<n>We present an analytical tool, GenZ, to efficiently navigate the relationship between diverse LLM model architectures and AI platform design parameters.
arXiv Detail & Related papers (2024-06-03T18:00:50Z)
Mechanistic Design and Scaling of Hybrid Architectures [114.3129802943915]
We identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis. We find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures.
arXiv Detail & Related papers (2024-03-26T16:33:12Z)
Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale [11.121380180647769]
We share in this paper our search strategies to adapt reference recommendation models to low-precision hardware. We also discuss the design and development of tool chain so as to maintain our models' accuracy throughout their lifespan. We believe these lessons from the trenches promote better co-design between hardware architecture and software engineering.
arXiv Detail & Related papers (2021-05-26T16:42:33Z)
Hardware-Centric AutoML for Mixed-Precision Quantization [34.39845532939529]
Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization.
arXiv Detail & Related papers (2020-08-11T17:30:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.