Related papers: RelayLLM: Efficient Reasoning via Collaborative Decoding

RelayLLM: Efficient Reasoning via Collaborative Decoding

URL: http://arxiv.org/abs/2601.05167v1
Date: Thu, 08 Jan 2026 17:56:16 GMT
Title: RelayLLM: Efficient Reasoning via Collaborative Decoding
Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang,
Abstract summary: RelayLLM is a novel framework for efficient reasoning via token-level collaborative decoding.<n>We show that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models.
Score: 23.351598429979024
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

Related papers

Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning [4.338036373287262]
ARTER presents a structured pipeline that achieves high performance without deep fine-tuning.<n>It strategically combines candidate generation, context-based scoring, adaptive routing, and selective reasoning.<n>On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets.
arXiv Detail & Related papers (2025-10-23T00:50:14Z)
How to Train Your LLM Web Agent: A Statistical Diagnosis [96.86317871461834]
We present the first statistically grounded study on compute allocation for LLM web-agent post-training.<n>Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT) and on-policy reinforcement learning.<n>Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++.
arXiv Detail & Related papers (2025-07-05T17:12:33Z)
Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router [9.580226379350737]
Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models.<n>Yet, many reasoning steps are relatively simple and can be handled by more efficient smaller-scale language models.<n>We propose R2-Reasoner, a novel framework that enables collaborative reasoning across heterogeneous LLMs.
arXiv Detail & Related papers (2025-06-06T09:18:56Z)
CoLA: Collaborative Low-Rank Adaptation [3.421904493396495]
Fine-tuning a pre-trained model for specific tasks achieves strong performance; however, it is computationally expensive and inefficient.<n>LoRA, in particular, has proven effective, but its application to multi-task scenarios is limited by interference between tasks.<n>We propose CoLA, a more flexible LoRA architecture and three collaborative strategies to enhance performance by better utilizing the quantitative relationships between $A$ and $B$.
arXiv Detail & Related papers (2025-05-21T12:46:42Z)
Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [68.29746557968107]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.<n> Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking [16.057622631156164]
Large language models (LLMs) have revolutionized the landscape of Natural Language Processing systems, but are computationally expensive. Previous studies have explored various approaches to harness the potential of Small Language Models (SLMs) as cost-effective alternatives to their larger counterparts. This work presents a novel SLM/LLM routing framework designed to improve computational efficiency and enhance task performance.
arXiv Detail & Related papers (2023-11-16T10:30:55Z)
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models [15.471290825100075]
We introduce a new paradigm for structurally pruning Large Language Models, called Compresso. Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process. In experiments, Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.
arXiv Detail & Related papers (2023-10-08T05:16:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.