Related papers: Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

URL: http://arxiv.org/abs/2511.07885v2
Date: Fri, 14 Nov 2025 00:53:12 GMT
Title: Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Authors: Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, Christopher Ré,
Abstract summary: Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure.<n>Two advances enable us to rethink this paradigm: small LMs now achieve competitive performance to frontier models on many tasks.<n>This raises the question: can local inference via redistribute demand from centralized infrastructure?<n>We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference.
Score: 39.049258055931524
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals $3$ findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.

Related papers

Sustainable LLM Inference using Context-Aware Model Switching [0.9455980760111498]
A key limitation in current AI deployments is the reliance on a one-size-fits-all inference strategy.<n>We propose a context-aware model switching approach that dynamically selects an appropriate language model based on query complexity.<n> Experimental results show that the model switching approach can reduce energy consumption by up to 67.5% compared to always using the largest model.
arXiv Detail & Related papers (2026-02-25T03:42:12Z)
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z)
Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning [70.56067503630486]
We argue that sixth-generation (6G) intelligence is not fluent token prediction but calibrated the capacity to imagine and choose.<n>We show that WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference.
arXiv Detail & Related papers (2025-11-04T17:22:22Z)
DOVA-PATBM: An Intelligent, Adaptive, and Scalable Framework for Optimizing Large-Scale EV Charging Infrastructure [3.74242093516574]
We present DOVA-PATBM (Deployment optimisation with Voronoi-oriented, Adaptive, POI-Aware Temporal Behaviour Model), a geo-computational framework that unifies contexts in a single pipeline.<n>The methodises heterogeneous data ( centrality, population, night lights, POIs, and feeder lines) onto a hierarchical H3 grid.<n>It infers intersection importance with a zone-normalised graph neural network model, and overlays a Voronoi tessellation that guarantees at least one five-port DC fast charger within every 30 km radius.
arXiv Detail & Related papers (2025-06-18T09:15:18Z)
Energy-Efficient Deep Learning for Traffic Classification on Microcontrollers [1.3124513975412255]
We present a practical deep learning (DL) approach for energy-efficient traffic classification on resource-limited microcontrollers.<n>We develop a lightweight 1D-CNN, optimized via hardware-aware neural architecture search (HW-NAS), which achieves 96.59% accuracy on the ISCX VPN-Non-VPN dataset.<n>We evaluate real-world inference performance on two microcontrollers.
arXiv Detail & Related papers (2025-06-12T16:10:22Z)
EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Edge Computing-Enabled Road Condition Monitoring: System Development and Evaluation [5.296678854362804]
Real-time pavement condition monitoring provides highway agencies with timely and accurate information. Existing technologies rely heavily on manual data processing, are expensive and therefore, difficult to scale for frequent, networklevel pavement condition monitoring. This study proposes a solution that capitalizes on the widespread availability of affordable Micro Electro-Mechanical System (MEMS) sensors, edge computing and internet connection capabilities of microcontrollers.
arXiv Detail & Related papers (2023-10-09T00:55:41Z)
Federated Learning for Energy-limited Wireless Networks: A Partial Model Aggregation Approach [79.59560136273917]
limited communication resources, bandwidth and energy, and data heterogeneity across devices are main bottlenecks for federated learning (FL) We first devise a novel FL framework with partial model aggregation (PMA) The proposed PMA-FL improves 2.72% and 11.6% accuracy on two typical heterogeneous datasets.
arXiv Detail & Related papers (2022-04-20T19:09:52Z)
Multi-Agent Meta-Reinforcement Learning for Self-Powered and Sustainable Edge Computing Systems [87.4519172058185]
An effective energy dispatch mechanism for self-powered wireless networks with edge computing capabilities is studied. A novel multi-agent meta-reinforcement learning (MAMRL) framework is proposed to solve the formulated problem. Experimental results show that the proposed MAMRL model can reduce up to 11% non-renewable energy usage and by 22.4% the energy cost.
arXiv Detail & Related papers (2020-02-20T04:58:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.