Green MLOps: Closed-Loop, Energy-Aware Inference with NVIDIA Triton, FastAPI, and Bio-Inspired Thresholding
- URL: http://arxiv.org/abs/2601.04250v1
- Date: Tue, 06 Jan 2026 15:50:11 GMT
- Title: Green MLOps: Closed-Loop, Energy-Aware Inference with NVIDIA Triton, FastAPI, and Bio-Inspired Thresholding
- Authors: Mustapha Hamdi, Mourad Jabou,
- Abstract summary: Bio-inspired framework maps protein-folding energy basins to inference cost landscapes.<n>A request is admitted only when the expected utility-to-energy trade-off is favorable.<n>Results connect biophysical energy models to Green MLORTOps and offer a practical, auditable basis for closed-loop energy-aware inference in production.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Energy efficiency is a first-order concern in AI deployment, as long-running inference can exceed training in cumulative carbon impact. We propose a bio-inspired framework that maps protein-folding energy basins to inference cost landscapes and controls execution via a decaying, closed-loop threshold. A request is admitted only when the expected utility-to-energy trade-off is favorable (high confidence/utility at low marginal energy and congestion), biasing operation toward the first acceptable local basin rather than pursuing costly global minima. We evaluate DistilBERT and ResNet-18 served through FastAPI with ONNX Runtime and NVIDIA Triton on an RTX 4000 Ada GPU. Our ablation study reveals that the bio-controller reduces processing time by 42% compared to standard open-loop execution (0.50s vs 0.29s on A100 test set), with a minimal accuracy degradation (<0.5%). Furthermore, we establish the efficiency boundaries between lightweight local serving (ORT) and managed batching (Triton). The results connect biophysical energy models to Green MLOps and offer a practical, auditable basis for closed-loop energy-aware inference in production.
Related papers
- Towards Green AI: Decoding the Energy of LLM Inference in Software Development [46.879983975894135]
AI-assisted tools are increasingly integrated into software development, but their reliance on large language models (LLMs) introduces substantial computational and energy costs.<n>We conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state.
arXiv Detail & Related papers (2026-02-05T14:38:19Z) - Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute [4.8312457834136175]
Non-production estimates and assumptions can overstate energy use by 4-20x.<n>We quantify achievable efficiency gains at the model, serving platform, and hardware levels.<n>We estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day.
arXiv Detail & Related papers (2025-09-24T15:32:01Z) - EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z) - How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference [0.0]
This paper introduces a novel infrastructure-aware benchmarking framework for quantifying the environmental footprint of AI inference across 30 state-of-the-art models as deployed in commercial data centers.<n>Our results show that o3 and DeepSeek-R1 emerge as the most energy-intensive models, consuming over 33 Wh per long prompt, more than 70 times the consumption of GPT-4.1 nano, and that Claude-3.7 Sonnet ranks highest in eco-efficiency.<n>These findings illustrate a growing paradox: Although AI is becoming cheaper and faster, its global adoption drives disproportionate resource consumption.
arXiv Detail & Related papers (2025-05-14T17:47:00Z) - Low-cost Embedded Breathing Rate Determination Using 802.15.4z IR-UWB Hardware for Remote Healthcare [2.6066253940276347]
We propose a convolutional neural network (CNN) specifically adapted to predict breathing rates from ultra-wideband (UWB) channel impulse response (CIR) data.<n>We show it is feasible to deploy the algorithm on an nRF52840 system-on-chip requiring only 46 KB of memory and operating with an inference time of only 192 ms.
arXiv Detail & Related papers (2025-04-03T07:54:25Z) - Ecomap: Sustainability-Driven Optimization of Multi-Tenant DNN Execution on Edge Servers [0.44784055850794474]
This paper introduces Ecomap, a framework that adjusts the maximum power threshold of edge devices based on real-time carbon intensity.<n> Experimental results using NVIDIA Jetson AGX Xavier demonstrate that Ecomap reduces carbon emissions by an average of 30%.
arXiv Detail & Related papers (2025-03-06T06:56:51Z) - A Safe Genetic Algorithm Approach for Energy Efficient Federated
Learning in Wireless Communication Networks [53.561797148529664]
Federated Learning (FL) has emerged as a decentralized technique, where contrary to traditional centralized approaches, devices perform a model training in a collaborative manner.
Despite the existing efforts made in FL, its environmental impact is still under investigation, since several critical challenges regarding its applicability to wireless networks have been identified.
The current work proposes a Genetic Algorithm (GA) approach, targeting the minimization of both the overall energy consumption of an FL process and any unnecessary resource utilization.
arXiv Detail & Related papers (2023-06-25T13:10:38Z) - Ultra-low Power Deep Learning-based Monocular Relative Localization
Onboard Nano-quadrotors [64.68349896377629]
This work presents a novel autonomous end-to-end system that addresses the monocular relative localization, through deep neural networks (DNNs), of two peer nano-drones.
To cope with the ultra-constrained nano-drone platform, we propose a vertically-integrated framework, including dataset augmentation, quantization, and system optimizations.
Experimental results show that our DNN can precisely localize a 10cm-size target nano-drone by employing only low-resolution monochrome images, up to 2m distance.
arXiv Detail & Related papers (2023-03-03T14:14:08Z) - BottleFit: Learning Compressed Representations in Deep Neural Networks
for Effective and Efficient Split Computing [48.11023234245863]
We propose a new framework called BottleFit, which includes a novel training strategy to achieve high accuracy even with strong compression rates.
BottleFit achieves 77.1% data compression with up to 0.6% accuracy loss on ImageNet dataset.
We show that BottleFit decreases power consumption and latency respectively by up to 49% and 89% with respect to (w.r.t.) local computing and by 37% and 55% w.r.t. edge offloading.
arXiv Detail & Related papers (2022-01-07T22:08:07Z) - Energy-Efficient Model Compression and Splitting for Collaborative
Inference Over Time-Varying Channels [52.60092598312894]
We propose a technique to reduce the total energy bill at the edge device by utilizing model compression and time-varying model split between the edge and remote nodes.
Our proposed solution results in minimal energy consumption and $CO$ emission compared to the considered baselines.
arXiv Detail & Related papers (2021-06-02T07:36:27Z) - ECO: Enabling Energy-Neutral IoT Devices through Runtime Allocation of
Harvested Energy [0.8774604259603302]
We present a runtime-based energy-allocation framework to optimize the utility of the target device under energy constraints.
The proposed framework uses an efficient iterative algorithm to compute initial energy allocations at the beginning of a day.
We evaluate this framework using solar and motion energy harvesting modalities and American Time Use Survey data from 4772 different users.
arXiv Detail & Related papers (2021-02-26T17:21:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.