Related papers: CALM: A Self-Adaptive Orchestration Approach for QoS-Aware Routing in Small Language Model based Systems

CALM: A Self-Adaptive Orchestration Approach for QoS-Aware Routing in Small Language Model based Systems

URL: http://arxiv.org/abs/2602.03632v1
Date: Tue, 03 Feb 2026 15:20:14 GMT
Title: CALM: A Self-Adaptive Orchestration Approach for QoS-Aware Routing in Small Language Model based Systems
Authors: Hemang Jain, Divyansh Pandey, Karthik Vaidhyanathan,
Abstract summary: CALM is a self-adaptive orchestration mechanism based on MAPE-K.<n>It reduces latency by approximately 40% and energy consumption by 50%.<n>Our evaluation shows that CALM reduces latency by approximately 40% and energy consumption by 50%.
Score: 0.6999740786886536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI-enabled systems are subjected to various types of runtime uncertainties, ranging from dynamic workloads, resource requirements, model drift, etc. These uncertainties have a big impact on the overall Quality of Service (QoS). This is particularly true in the case of Language Model (LM) enabled systems where the autoregressive nature of token generation introduces variability in latency, energy usage and response quality. These systems, powered by LLMs, are either resource-intensive (if run on-prem) or raise privacy/cost concerns (if leveraged using APIs). While deploying a Small Language Model (SLM) can be resource-efficient, it often falls short in addressing the diversity and scale of real-world requirements. To this, we argue that, rather than relying on any one SLM, leveraging a coordinated fleet of SLMs, each with specialized strengths can enable systems to dynamically adapt to shifting contexts and workload patterns. However, realizing the full potential of such an approach demands intelligent orchestration and continuous adaptation. To this end, we introduce CALM , a self-adaptive orchestration mechanism based on MAPE-K. Our approach continuously monitors user queries, analyzes the QoS metrics of the SLMs, identifies the optimal SLM to be used, routes the query to the identified SLM and further to enhance the effectiveness and efficiency, leverages caching and scheduling to decide the SLMs to be kept in memory. Our evaluation shows that CALM reduces latency by approximately 40% and energy consumption by 50%, while preserving domain-specific task performance when compared to single-LLM baselines.

Related papers

SLMQuant:Benchmarking Small Language Model Quantization for Practical Deployment [45.23402877397396]
This paper introduces SLMQuant, the first systematic benchmark for evaluating compression techniques when applied to Small Language Models (SLMs)<n>We analyze how state-of-the-art quantization methods perform on SLMs.<n>We identify key factors governing effective SLM quantization and propose actionable design principles for SLM-tailored compression.
arXiv Detail & Related papers (2025-11-17T06:20:33Z)
Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts [18.479200918676575]
Large Language Models (LLMs) have demonstrated remarkable capabilities, leading to a significant increase in user demand for LLM services.<n>However, cloud-based LLM services often suffer from high latency, unstable responsiveness, and privacy concerns.<n>This paper proposes a novel deep reinforcement learning-based routing framework for sustained high-quality LLM services.
arXiv Detail & Related papers (2025-08-01T00:45:15Z)
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs [56.76586846269894]
Multimodal Large Language Models (MLLMs) have achieved success across various domains.<n>Despite its importance, the study of knowledge sharing among domain-specific MLLMs remains largely underexplored.<n>We propose a unified parameter integration framework that enables modular composition of expert capabilities.
arXiv Detail & Related papers (2025-06-30T15:07:41Z)
Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission [87.68447072141402]
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers.<n>We propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL)
arXiv Detail & Related papers (2025-06-30T02:56:11Z)
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision [76.42361936804313]
We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design.<n> MAS-ZERO employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance.
arXiv Detail & Related papers (2025-05-21T00:56:09Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.<n>One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.<n>We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment [12.80921403367322]
Large Language Models (LLMs) demonstrate exceptional performance across various domains.<n> Quantization techniques, which reduce the size and memory requirements of LLMs, are effective for deploying LLMs on resource-limited edge devices.<n>We propose Layer-Specific Adaptive Quantization (LSAQ), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance.
arXiv Detail & Related papers (2024-12-24T03:43:15Z)
AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment [13.977849745488339]
AmoebaLLM is a novel framework designed to enable the instant derivation of large language models of arbitrary shapes. AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications.
arXiv Detail & Related papers (2024-11-15T22:02:28Z)
Large Language Model as a Catalyst: A Paradigm Shift in Base Station Siting Optimization [62.16747639440893]
Large language models (LLMs) and their associated technologies advance, particularly in the realms of prompt engineering and agent engineering.<n>Our proposed framework incorporates retrieval-augmented generation (RAG) to enhance the system's ability to acquire domain-specific knowledge and generate solutions.
arXiv Detail & Related papers (2024-08-07T08:43:32Z)
Towards Self-Adaptive Machine Learning-Enabled Systems Through QoS-Aware Model Switching [1.2277343096128712]
We propose the concept of a Machine Learning Model Balancer, focusing on managing uncertainties related to ML models by using multiple models. AdaMLS is a novel self-adaptation approach that leverages this concept and extends the traditional MAPE-K loop for continuous MLS adaptation. Preliminary results suggest AdaMLS surpasses naive and single state-of-the-art models in guarantees.
arXiv Detail & Related papers (2023-08-19T09:33:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.