Related papers: RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis

RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis

URL: http://arxiv.org/abs/2602.11506v1
Date: Thu, 12 Feb 2026 03:02:22 GMT
Title: RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis
Authors: Zhen Bi, Xueshu Chen, Luoyang Sun, Yuhang Yao, Qing Shen, Jungang Lou, Cheng Deng,
Abstract summary: The transition toward localized intelligence through Small Language Models (SLMs) has intensified the need for rigorous performance characterization on resource-constrained edge hardware.<n>We propose a systematic framework that unifies architectural primitives and hardware constraints through the lens of operational intensity (OI)<n>By defining an inference-potential region, we introduce the Relative Inference Potential as a novel metric to compare efficiency differences between Large Language Models (LLMs) on the same hardware substrate.
Score: 53.90240071275054
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transition toward localized intelligence through Small Language Models (SLMs) has intensified the need for rigorous performance characterization on resource-constrained edge hardware. However, objectively measuring the theoretical performance ceilings of diverse architectures across heterogeneous platforms remains a formidable challenge. In this work, we propose a systematic framework based on the Roofline model that unifies architectural primitives and hardware constraints through the lens of operational intensity (OI). By defining an inference-potential region, we introduce the Relative Inference Potential as a novel metric to compare efficiency differences between Large Language Models (LLMs) on the same hardware substrate. Extensive empirical analysis across diverse compute tiers reveals that variations in performance and OI are significantly influenced by sequence length. We further identify a critical regression in OI as model depth increases. Additionally, our findings highlight an efficiency trap induced by hardware heterogeneity and demonstrate how structural refinements, such as Multi-head Latent Attention (M LA), can effectively unlock latent inference potential across various hardware substrates. These insights provide actionable directions for hardware-software co-design to align neural structures with physical constraints in on-device intelligence. The released code is available in the Appendix C.

Related papers

Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z)
Towards Worst-Case Guarantees with Scale-Aware Interpretability [58.519943565092724]
Neural networks organize information according to the hierarchical, multi-scale structure of natural data.<n>We propose a unifying research agenda -- emphscale-aware interpretability -- to develop formal machinery and interpretability tools.
arXiv Detail & Related papers (2026-02-05T01:22:31Z)
AR-MOT: Autoregressive Multi-object Tracking [56.09738000988466]
We propose a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework.<n>This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads.<n>To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector.
arXiv Detail & Related papers (2026-01-05T09:17:28Z)
Quantum-Aware Generative AI for Materials Discovery: A Framework for Robust Exploration Beyond DFT Biases [0.0]
We introduce a quantum-aware generative AI framework for materials discovery.<n>We implement a robust active learning loop that quantifies and targets the divergence between low- and high-fidelity predictions.<n>Our results demonstrate a 3-5x improvement in successfully identifying potentially stable candidates in high-divergence regions.
arXiv Detail & Related papers (2025-12-13T11:17:21Z)
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models [51.817121227562964]
Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models.<n> Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties.<n>The traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment.
arXiv Detail & Related papers (2025-08-13T14:13:46Z)
ESM: A Framework for Building Effective Surrogate Models for Hardware-Aware Neural Architecture Search [4.9276746621153285]
Hardware-aware Neural Architecture Search (NAS) is one of the most promising techniques for designing efficient Deep Neural Networks (DNNs) for resource-constrained devices.<n>We study different types of surrogate models and highlight their strengths and weaknesses.<n>We present a holistic framework that enables reliable dataset generation and efficient model generation, considering the overall costs of different stages of the model generation pipeline.
arXiv Detail & Related papers (2025-08-02T22:06:39Z)
Scaling Intelligence: Designing Data Centers for Next-Gen Language Models [0.6168147650666682]
Large Language Models (LLMs), such as GPT-4 with 1.8 trillion parameters, demand a fundamental rethinking of data center architecture.<n>Our work provides a comprehensive co-design framework that jointly explores FLOPS, bandwidth and capacity, multiple network topologies.<n>We quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, widening the scale-out domain, and increasing memory capacity.
arXiv Detail & Related papers (2025-06-17T22:29:37Z)
Multi-Scale Manifold Alignment for Interpreting Large Language Models: A Unified Information-Geometric Framework [4.935224714809964]
We present Multi-Scale Manifold Alignment (MSMA), an information-geometric framework that decomposes LLM representations into local, intermediate, and global manifold.<n>We observe consistent hierarchical patterns and find that MSMA improves alignment metrics under multiple estimators.<n> Controlled interventions at different scales yield distinct and architecture-dependent effects on lexical diversity, sentence structure, and discourse coherence.
arXiv Detail & Related papers (2025-05-24T10:25:58Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Semantic Layered Embedding Diffusion in Large Language Models for Multi-Contextual Consistency [0.0]
The Semantic Layered Embedding Diffusion (SLED) mechanism redefines the representation of hierarchical semantics within transformer-based architectures.<n>By introducing a multi-layered diffusion process grounded in spectral analysis, it achieves a complex balance between global and local semantic coherence.<n> Experimental results demonstrate significant improvements in perplexity and BLEU scores, emphasizing the mechanism's ability to adapt effectively across diverse domains.
arXiv Detail & Related papers (2025-01-26T05:17:04Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Learning Interpretable Models Through Multi-Objective Neural Architecture Search [0.9990687944474739]
We propose a framework to optimize for both task performance and "introspectability," a surrogate metric for aspects of interpretability. We demonstrate that jointly optimizing for task error and introspectability leads to more disentangled and debuggable architectures that perform within error.
arXiv Detail & Related papers (2021-12-16T05:50:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.