Related papers: Sustainable LLM Inference using Context-Aware Model Switching

Sustainable LLM Inference using Context-Aware Model Switching

URL: http://arxiv.org/abs/2602.22261v1
Date: Wed, 25 Feb 2026 03:42:12 GMT
Title: Sustainable LLM Inference using Context-Aware Model Switching
Authors: Yuvarani, Akashdeep Singh, Zahra Fathanah, Salsabila Harlen, Syeikha Syafura Al-Zahra binti Zahari, Hema Subramaniam,
Abstract summary: A key limitation in current AI deployments is the reliance on a one-size-fits-all inference strategy.<n>We propose a context-aware model switching approach that dynamically selects an appropriate language model based on query complexity.<n> Experimental results show that the model switching approach can reduce energy consumption by up to 67.5% compared to always using the largest model.
Score: 0.9455980760111498
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have become central to many AI applications, but their growing energy consumption raises serious sustainability concerns. A key limitation in current AI deployments is the reliance on a one-size-fits-all inference strategy where most systems route every request to the same large model, regardless of task complexity, leading to substantial and unnecessary energy waste. To address this issue, we propose a context-aware model switching approach that dynamically selects an appropriate language model based on query complexity. The proposed system uses a Context-Aware Model Switching for Energy-Efficient LLM Inference that combines caching for repeated queries, rulebased complexity scoring for fast and explainable decisions, machine learning classification to capture semantic intent, and a user-adaptive component that learns from interaction patterns over time. The proposed architecture was evaluated using real conversation workloads and three open-source language models (Gemma3 1B, Gemma3 4B and Qwen3 4B) with different computational costs, measuring energy consumption (via NVML GPU power telemetry), response latency, routing accuracy, and output quality (BERTScore F1) to reflect real-world usage conditions. Experimental results show that the model switching approach can reduce energy consumption by up to 67.5% compared to always using the largest model while maintaining a response quality of 93.6%. In addition, the response time for simple queries also improved significantly by approximately 68%. These results show that model switching inference offers a practical and scalable path toward more energy-efficient and sustainable AI systems, demonstrating that significant efficiency gains can be achieved without major sacrifices in response quality.

Related papers

Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use [4.513690948889834]
Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference.<n>We show how emphsystem-level design choices can lead to orders-of-magnitude differences in energy consumption for the same model.<n>Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.
arXiv Detail & Related papers (2026-01-29T22:16:25Z)
RE-LLM: Integrating Large Language Models into Renewable Energy Systems [0.7466390172678973]
We propose the Renewable Energy Large Language Model (RE-LLM), a hybrid framework that integrates Large Language Models (LLMs) directly into the energy system modeling workflow.<n>RE-LLM combines three core elements: (i) optimization-based scenario exploration, (ii) machine learning surrogates that accelerate computationally intensive simulations, and (iii) LLM-powered natural language generation that translates complex results into clear, stakeholder-oriented explanations.<n>It enables interactive, multilingual, and accessible engagement with future energy pathways, ultimately bridging the final gap between data-driven analysis and actionable decision-making for sustainable transitions.
arXiv Detail & Related papers (2025-12-01T08:10:39Z)
Comparing energy consumption and accuracy in text classification inference [0.9208007322096533]
This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference.<n>The best-performing model in terms of accuracy can also be energy-efficient, while larger LLMs tend to consume significantly more energy with lower classification accuracy.
arXiv Detail & Related papers (2025-08-19T18:00:08Z)
Lightweight Task-Oriented Semantic Communication Empowered by Large-Scale AI Models [66.57755931421285]
Large-scale artificial intelligence (LAI) models pose significant challenges for real-time communication scenarios.<n>This paper proposes utilizing knowledge distillation (KD) techniques to extract and condense knowledge from LAI models.<n>We propose a fast distillation method featuring a pre-stored compression mechanism that eliminates the need for repetitive inference.
arXiv Detail & Related papers (2025-06-16T08:42:16Z)
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z)
Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning [36.470695895695044]
Self-Route is a dynamic reasoning framework that automatically selects between general and reasoning modes.<n>We show that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55%.
arXiv Detail & Related papers (2025-05-27T03:18:31Z)
Learning to Rank Chain-of-Thought: Using a Small Model [77.75522308463667]
This paper introduces the Energy Outcome Reward Model (EORM), a highly efficient, lightweight verifier designed to address this challenge.<n>EORM uses an energy-based framework to rank Chain-of-Thought (CoT) solutions, learning to distinguish correct from incorrect reasoning using only simple outcome labels.<n>With only 55M parameters, over 127 times smaller than typical reward models, EORM boosts the accuracy of Llama 3 8B to 90.7% on GSM8k and 63.7% on MATH.
arXiv Detail & Related papers (2025-05-21T01:06:29Z)
Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks [55.32199894495722]
We investigate an LMM-based vehicle AI assistant using a Large Language and Vision Assistant (LLaVA)<n>To reduce computational demands and shorten response time, we optimize LLaVA's image slicing to selectively focus on areas of utmost interest to users.<n>We construct a Visual Question Answering (VQA) dataset for traffic scenarios to evaluate effectiveness.
arXiv Detail & Related papers (2025-05-05T07:18:47Z)
Energy-Aware LLMs: A step towards sustainable AI for downstream applications [0.9012198585960441]
Advanced Large Language Models (LLMs) have revolutionized various fields, including communication networks.<n>LLMs typically require huge computational resources, resulting in terribly high energy consumption.<n>This research study proposes an end-to-end pipeline that investigates the trade-off between energy efficiency and model performance.
arXiv Detail & Related papers (2025-03-22T14:28:29Z)
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning.<n>Our approach only leverages a small number of samples to search for the desired pruning policy.<n>We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z)
Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance. Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z)
Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations. The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.