Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
- URL: http://arxiv.org/abs/2602.19509v1
- Date: Mon, 23 Feb 2026 04:47:47 GMT
- Title: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
- Authors: Arindam Khaled,
- Abstract summary: "Pyramid MoA" is a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.<n>We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While "Oracle" models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohibitively expensive for high-volume deployment. Smaller models (e.g., 8B parameters) are cost-effective but struggle with complex tasks. In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary. By leveraging semantic agreement and confidence calibration among an ensemble of small models, our Router identifies "hard" problems with high precision. On the GSM8K benchmark, our system achieves 93.0% accuracy, effectively matching the Oracle baseline (98.0%) while reducing compute costs by 61%. We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.
Related papers
- RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents [91.0187958746262]
RouteMoA is an efficient mixture-of-agents framework with dynamic routing.<n>It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query.<n>It refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference.
arXiv Detail & Related papers (2026-01-26T04:22:22Z) - EvoRoute: Experience-Driven Self-Routing LLM Agent Systems [100.64399490164959]
EvoRoute is a self-evolving model routing paradigm that transcends static, pre-defined model assignments.<n> Experiments on challenging agentic benchmarks demonstrate that EvoRoute, when integrated into off-the-shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to $80%$ and latency by over $70%$.
arXiv Detail & Related papers (2026-01-06T04:06:46Z) - Efficient Multi-Model Orchestration for Self-Hosted Large Language Models [2.3275796286410677]
Pick and Spin is a framework that makes self-hosted orchestration and economical.<n>It integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module.<n>It achieves up to 21.6% higher success rates, 30% lower latency, and 33% lower cost per query compared with static deployments of the same models.
arXiv Detail & Related papers (2025-12-26T22:42:40Z) - Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression [68.69801176669843]
We propose an online post-training RL method that prunes redundant steps and estimates difficulty.<n> TRAAC (Think Right with Adaptive, Attentive Compression) achieves an average absolute accuracy gain of 8.4%.<n>Although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets.
arXiv Detail & Related papers (2025-10-02T02:00:20Z) - Nano Bio-Agents (NBA): Small Language Model Agents for Genomics [0.1790445868185437]
We investigate the application of Small Language Models (10 billion parameters) for genomics question answering via agentic framework.<n>Results show that SLMs combined with such agentic framework can achieve comparable and in many cases superior performance.<n>This demonstrates promising potential for efficiency gains, cost savings, and democratization of ML-powered genomics tools.
arXiv Detail & Related papers (2025-09-23T20:44:31Z) - SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model [12.929385845055137]
We show that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone.<n>We propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes.
arXiv Detail & Related papers (2025-07-03T17:33:58Z) - Efficient LLM Collaboration via Planning [56.081879390960204]
Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks.<n>We demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost.
arXiv Detail & Related papers (2025-06-13T08:35:50Z) - Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning [19.258292534503887]
Plan-and-Budget is a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling.<n>Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, tangential -39% token reduction, and +187.5% improvement in $E3$.
arXiv Detail & Related papers (2025-05-22T01:56:29Z) - Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs [96.68469559192846]
We present two differently sized MoE large language models (LLMs)<n>Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters.<n>We propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency.
arXiv Detail & Related papers (2025-03-07T04:43:39Z) - Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models.
For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z) - Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing [53.748685766139715]
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size.
We propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality.
In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
arXiv Detail & Related papers (2024-04-22T23:06:42Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.