Efficient Multi-Model Orchestration for Self-Hosted Large Language Models
- URL: http://arxiv.org/abs/2512.22402v1
- Date: Fri, 26 Dec 2025 22:42:40 GMT
- Title: Efficient Multi-Model Orchestration for Self-Hosted Large Language Models
- Authors: Bhanu Prakash Vangala, Tanu Malik,
- Abstract summary: Pick and Spin is a framework that makes self-hosted orchestration and economical.<n>It integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module.<n>It achieves up to 21.6% higher success rates, 30% lower latency, and 33% lower cost per query compared with static deployments of the same models.
- Score: 2.3275796286410677
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Self-hosting large language models (LLMs) is increasingly appealing for organizations seeking privacy, cost control, and customization. Yet deploying and maintaining in-house models poses challenges in GPU utilization, workload routing, and reliability. We introduce Pick and Spin, a practical framework that makes self-hosted LLM orchestration scalable and economical. Built on Kubernetes, it integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module that balances cost, latency, and accuracy using both keyword heuristics and a lightweight DistilBERT classifier. We evaluate four models, Llama-3 (90B), Gemma-3 (27B), Qwen-3 (235B), and DeepSeek-R1 (685B) across eight public benchmark datasets, with five inference strategies, and two routing variants encompassing 31,019 prompts and 163,720 inference runs. Pick and Spin achieves up to 21.6% higher success rates, 30% lower latency, and 33% lower GPU cost per query compared with static deployments of the same models.
Related papers
- Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference [0.0]
"Pyramid MoA" is a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.<n>We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.
arXiv Detail & Related papers (2026-02-23T04:47:47Z) - 6G-Bench: An Open Benchmark for Semantic Communication and Network-Level Reasoning with Foundation Models in AI-Native 6G Networks [3.099103925863002]
6G-Bench is an open benchmark for evaluating semantic communication and network-level reasoning in AI-native 6G networks.<n>We generate a balanced pool of 10,000 very-hard multiple-choice questions using task-conditioned prompts.<n>We evaluate 22 foundation models spanning dense and mixture-of-experts architectures, short-context and long-context designs.
arXiv Detail & Related papers (2026-02-09T13:57:37Z) - RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents [91.0187958746262]
RouteMoA is an efficient mixture-of-agents framework with dynamic routing.<n>It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query.<n>It refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference.
arXiv Detail & Related papers (2026-01-26T04:22:22Z) - Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs [80.72350166388601]
Nemotron Elastic is a framework for building reasoning-oriented LLMs.<n>It embeds nested submodels within a single parent model.<n>Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment.
arXiv Detail & Related papers (2025-11-20T18:59:21Z) - GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [194.64264251080454]
We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters.<n>Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding tasks.<n>We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems.
arXiv Detail & Related papers (2025-08-08T17:21:06Z) - Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction [95.91743732150233]
Goedel-Prover-V2, a series of open-source language models, set a new state-of-the-art in automated theorem proving.<n>We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems.<n>Goedel-Prover-V2-32B achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode.
arXiv Detail & Related papers (2025-08-05T16:28:22Z) - Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought [196.74837065805488]
Hunyuan-TurboS is a large hybrid Transformer-Mamba Mixture of Experts model.<n>It balances high performance and efficiency, offering substantial capabilities at lower inference costs.
arXiv Detail & Related papers (2025-05-21T12:11:53Z) - Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks [0.0]
Small or compact models, though more efficient, often lack sufficient support for underrepresented languages.<n>This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks.<n> tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks.
arXiv Detail & Related papers (2025-03-18T07:44:49Z) - Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing [9.217991144854851]
Mixture-of-Experts (MoE) models have been a dominant type of model architectures nowadays.<n>We study optimized MoE model deployment and distributed inference serving on a serverless platform.<n>Our designs reduce the billed cost of all MoE layers by at least 75.67% compared to CPU clusters.
arXiv Detail & Related papers (2025-01-09T15:29:33Z) - Puzzle: Distillation-Based NAS for Inference-Optimized LLMs [17.72841008597783]
Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption.<n>We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities.<n>We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models.
arXiv Detail & Related papers (2024-11-28T13:45:42Z) - Mixture of Experts with Mixture of Precisions for Tuning Quality of Service [0.0]
This paper presents an adaptive serving approach for the efficient deployment of MoE models.
By dynamically determining the number of quantized experts, we offer a fine-grained range of configurations for tuning throughput and model quality.
Results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications.
arXiv Detail & Related papers (2024-07-19T15:42:49Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.