FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees
- URL: http://arxiv.org/abs/2402.18789v3
- Date: Thu, 23 Oct 2025 16:07:13 GMT
- Title: FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees
- Authors: Gabriele Oliaro, Xupeng Miao, Xinhao Cheng, Vineeth Kada, Mengdi Wu, Ruohan Gao, Yingyi Huang, Remi Delacourt, April Yang, Yingcheng Wang, Colin Unger, Zhihao Jia,
- Abstract summary: Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters.<n>We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing at the token level.<n>At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration.
- Score: 19.58773369944074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations -- dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by $1.9-4.8\times$ under heavy inference workloads and $2.5-6.8\times$ under light loads, preserving over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly available at https://flexllm.github.io.
Related papers
- Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts [68.79341332280062]
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time.<n>We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier.<n>Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint.
arXiv Detail & Related papers (2026-02-02T13:52:40Z) - ZeroDVFS: Zero-Shot LLM-Guided Core and Frequency Allocation for Embedded Platforms [7.633618497843279]
We propose a model-based hierarchical multi-agent reinforcement learning (MARL) framework for thermal- and energy-aware scheduling on multi-core platforms.<n>First-decision latency is 8,300x faster than table-based profiling, enabling practical deployment in dynamic embedded systems.
arXiv Detail & Related papers (2026-01-13T02:56:06Z) - DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing [15.376910065679994]
DuetServe is a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU.<n>DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.
arXiv Detail & Related papers (2025-11-06T20:18:34Z) - xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z) - dInfer: An Efficient Inference Framework for Diffusion Language Models [54.80918957287927]
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs.<n>We present dInfer, an efficient and framework for dLLM inference.
arXiv Detail & Related papers (2025-10-09T16:19:42Z) - FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design [18.37843481770631]
Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs.<n>Existing INT4/INT8 quantization reduces these costs, but they often degrade accuracy or lack optimal efficiency.<n>We propose FlexQ, a novel framework combining algorithmic innovation with system-level evaluations.
arXiv Detail & Related papers (2025-08-06T12:47:05Z) - BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving [3.620158146761518]
BucketServe is a bucket-based dynamic framework designed to optimize inference performance.<n>It can handle 1.93x more request load load attainment of 80% compared with UELLM and demonstrates 1.975x higher system load capacity compared to the UELLM.
arXiv Detail & Related papers (2025-07-23T01:51:48Z) - Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving [4.309392302169281]
Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead.<n>PD achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SG by up to 2x; and matches or exceeds disaggregated vLLM.
arXiv Detail & Related papers (2025-07-09T07:27:18Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices [36.714057078457195]
We present TPI-LLM, a compute- and memory-efficient tensor parallel inference system for 70B-scale models.
TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler.
We show that TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate.
arXiv Detail & Related papers (2024-10-01T09:18:56Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment.
We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model.
We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z) - MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter [40.616849959987555]
We introduce a novel mechanism that fine-tunes Large Language Models (LLMs) with adapters of larger size yet memory-efficient.
This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs.
We employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU.
arXiv Detail & Related papers (2024-06-07T14:49:22Z) - LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism [12.521026493432181]
Existing large language models (LLMs) cannot efficiently serve variable-length requests in different phases.
We propose a new parallelism paradigm, elastic sequence parallelism (ESP), to adapt to the variance between different requests and phases.
LoongServe improves the maximum throughput by up to 3.85$times$ compared to the chunked prefill and 5.81$times$ compared to the prefill-decoding disaggregation.
arXiv Detail & Related papers (2024-04-15T07:45:04Z) - JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning [16.86356520836045]
We introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training.
Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management.
Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPU while consuming less than half the VRAM per GPU.
arXiv Detail & Related papers (2024-03-17T23:02:04Z) - Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - Tutel: Adaptive Mixture-of-Experts at Scale [20.036168971435306]
Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost.
We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining.
Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture.
arXiv Detail & Related papers (2022-06-07T15:20:20Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z) - Dynamic Parameter Allocation in Parameter Servers [74.250687861348]
We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse.
We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers.
arXiv Detail & Related papers (2020-02-03T11:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.