In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs
- URL: http://arxiv.org/abs/2512.02543v1
- Date: Tue, 02 Dec 2025 09:11:05 GMT
- Title: In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs
- Authors: Vishnu Sarukkai, Asanshay Gupta, James Hong, Michaƫl Gharbi, Kayvon Fatahalian,
- Abstract summary: We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with fine-tuning.<n>Most importantly, we introduce $textitin-context distillation$, which adapts the idea of knowledge distillation to an in-context learning setting.<n>Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on thefly.
- Score: 15.204355975284658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $\textit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $\textit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $\textbf{2.5$\times$ lower cost}$, reducing per-episode costs from \$0.059 to \$0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding \$34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $\textbf{2$\times$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.
Related papers
- $\
abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z) - Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates [53.3717573880076]
We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.<n>JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.<n>Experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
arXiv Detail & Related papers (2026-01-26T14:16:51Z) - EvoRoute: Experience-Driven Self-Routing LLM Agent Systems [100.64399490164959]
EvoRoute is a self-evolving model routing paradigm that transcends static, pre-defined model assignments.<n> Experiments on challenging agentic benchmarks demonstrate that EvoRoute, when integrated into off-the-shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to $80%$ and latency by over $70%$.
arXiv Detail & Related papers (2026-01-06T04:06:46Z) - xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning [104.63494870852894]
We present x, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models.<n>Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting.<n>Across diverse benchmarks, x achieves strong cost-performance trade-offs.
arXiv Detail & Related papers (2025-10-09T16:52:01Z) - LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora [0.1625256372381793]
Large Language Models (LLMs) are highly accurate in classification tasks.<n> Knowledge Distillation (KD) where a LLM "teacher" trains a smaller and more efficient "student" model, offers a promising solution to this problem.<n>We introduce M-RARU (Multi-class Randomized Accept/Reject Uncertainty Sampling), a novel AL algorithm that significantly reduces training costs.
arXiv Detail & Related papers (2025-09-17T18:38:56Z) - Memento: Fine-tuning LLM Agents without Fine-tuning LLMs [36.3424780932712]
We introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents.<n>Our method enables low-cost continual adaptation via memory-based online reinforcement learning.<n>We instantiate our agent model in the deep research setting, namely emphMemento, which attains top-1 on GAIA validation.
arXiv Detail & Related papers (2025-08-22T07:25:30Z) - Economic Evaluation of LLMs [0.9208007322096532]
We show that reasoning models offer better accuracy-cost tradeoffs as soon as the economic cost of a mistake exceeds $0.01.<n>We find that single large LLMs often outperform cascades when the cost of making a mistake is as low as $0.1.
arXiv Detail & Related papers (2025-07-04T23:16:02Z) - Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [54.67512489842682]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z) - Distilling LLM Agent into Small Models with Retrieval and Code Tools [65.73762766854192]
Agent Distillation is a framework for transferring reasoning capability and task-solving behavior from large language models into small language models.<n>Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models.
arXiv Detail & Related papers (2025-05-23T08:20:15Z) - From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs [23.253571170594455]
Large Language Models (LLMs) have significantly advanced artificial intelligence.<n>This paper introduces a three-stage cost-efficient end-to-end LLM deployment pipeline.<n>It produces super-tiny online models with enhanced performance and reduced costs.
arXiv Detail & Related papers (2025-04-18T05:25:22Z) - Towards Efficient Automatic Self-Pruning of Large Language Models [55.90119819642064]
Post-training structured pruning is a promising solution that prunes Large Language Models without the need for retraining.<n>We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer.<n>We introduce $textbfSelf-Pruner$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates.
arXiv Detail & Related papers (2025-02-20T09:59:50Z) - Leveraging Zero-Shot Prompting for Efficient Language Model Distillation [3.4205390087622582]
This paper introduces a novel approach for efficiently distilling LLMs into smaller, application-specific models.
It utilizes LLMs' reasoning capabilities to generate labels and natural language rationales for unlabeled data.
Key contributions include the employment of zero-shot prompting to elicit teacher model rationales.
arXiv Detail & Related papers (2024-03-23T16:51:52Z) - SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees [21.801053526411415]
Large Language Models (LLMs) have significantly boosted performance in natural language processing (NLP) tasks.
The deployment of high-performance LLMs incurs substantial costs, primarily due to the increased number of parameters aimed at enhancing model performance.
We introduce SMART, a novel framework designed to minimize the inference costs of NLP tasks while ensuring sufficient result quality.
arXiv Detail & Related papers (2024-03-11T17:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.