Related papers: SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

URL: http://arxiv.org/abs/2601.22397v1
Date: Thu, 29 Jan 2026 23:08:15 GMT
Title: SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning
Authors: Jianchang Su, Yifan Zhang, Shengkai Lin, Shizhen Zhao, Yusheng Zheng, Yiwei Yang, Wei Zhang,
Abstract summary: Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration.<n>We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller.<n>SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97%.
Score: 13.174004826305255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration. We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller, improving its policy online from reward-labeled interaction histories without gradient updates. SAIR combines Pareto-dominance reward shaping with a provable separation margin, surprisal-guided experience retrieval for context efficiency, and fine-grained GPU rate control via user-space CUDA interception. We provide regret analysis decomposing error into retrieval coverage and LLM selection components. On four ML serving pipelines under three workload patterns, SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97% (under GPU rate-control assumptions), with 86% bottleneck detection accuracy and no offline training.

Related papers

Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA [16.758340727602793]
Table Question Answering (TQA) aims to answer natural language questions over structured tables.<n>Large Language Models (LLMs) enable promising solutions to this problem, with operator-centric solutions that generate table manipulation pipelines in a multi-step manner offering state-of-the-art performance.<n>We propose Operation-R1, the first framework that trains lightweight LLMs via a novel variant of reinforcement learning with verifiable rewards to produce high-quality data-preparation pipelines for TQA in a single inference step.
arXiv Detail & Related papers (2026-02-26T07:49:50Z)
Dr.LLM: Dynamic Layer Routing in LLMs [55.11953638340419]
Dr.LLM is a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block.<n>On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average.
arXiv Detail & Related papers (2025-10-14T17:51:26Z)
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
How to Train Your LLM Web Agent: A Statistical Diagnosis [96.86317871461834]
We present the first statistically grounded study on compute allocation for LLM web-agent post-training.<n>Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT) and on-policy reinforcement learning.<n>Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++.
arXiv Detail & Related papers (2025-07-05T17:12:33Z)
Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM-Derived Priors Across Discrete and Continuous Domains [2.1797343876622097]
Large language models (LLMs) as priors in reinforcement learning (RL) offers significant advantages but comes with substantial computational costs.<n>We present a principled cache-efficient framework for posterior sampling with LLM-derived priors that dramatically reduces these costs while maintaining high performance.
arXiv Detail & Related papers (2025-05-12T06:53:24Z)
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.<n>One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.<n>We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
Efficiently Deploying LLMs with Controlled Risk [0.9208007322096532]
We present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries. Our framework presents novel trade-offs between efficiency and risk.
arXiv Detail & Related papers (2024-10-03T03:25:56Z)
Parallel Split Learning with Global Sampling [9.57839529462706]
We introduce a server-driven sampling strategy that maintains a fixed global batch size by dynamically adjusting client-side batch sizes.<n>This decouples the effective batch size from the number of participating devices and ensures that global batches better reflect the overall data distribution.<n> Empirical results on a benchmark dataset confirm that the proposed method improves model accuracy, training efficiency, and convergence stability.
arXiv Detail & Related papers (2024-07-22T15:41:23Z)
Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.