Staircase Streaming for Low-Latency Multi-Agent Inference
- URL: http://arxiv.org/abs/2510.05059v1
- Date: Mon, 06 Oct 2025 17:37:35 GMT
- Title: Staircase Streaming for Low-Latency Multi-Agent Inference
- Authors: Junlin Wang, Jue Wang, Zhen, Xu, Ben Athiwaratkun, Bhuwan Dhingra, Ce Zhang, James Zou,
- Abstract summary: We propose staircase streaming for low-latency multi-agent inference.<n>We show that staircase streaming reduces TTFT by up to 93% while maintaining response quality.
- Score: 43.669722983497856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models (LLMs) opened up new directions for leveraging the collective expertise of multiple LLMs. These methods, such as Mixture-of-Agents, typically employ additional inference steps to generate intermediate outputs, which are then used to produce the final response. While multi-agent inference can enhance response quality, it can significantly increase the time to first token (TTFT), posing a challenge for latency-sensitive applications and hurting user experience. To address this issue, we propose staircase streaming for low-latency multi-agent inference. Instead of waiting for the complete intermediate outputs from previous steps, we begin generating the final response as soon as we receive partial outputs from these steps. Experimental results demonstrate that staircase streaming reduces TTFT by up to 93% while maintaining response quality.
Related papers
- A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - Solving a Million-Step LLM Task with Zero Errors [13.911986576836568]
This paper describes MAKER, the first system that successfully solves a task with over one million LLM steps with zero errors.<n>The results suggest that instead of relying on continual improvement of current LLMs, massively decomposed agentic processes (MDAPs) may provide a way to efficiently solve problems at the level of organizations and societies.
arXiv Detail & Related papers (2025-11-12T06:27:55Z) - Direct Multi-Token Decoding [24.347862297812977]
We introduce Direct Multi-Token Decoding (DMTD) as an inference paradigm for large language models (LLMs)<n>Unlike speculative decoding, our method introduces no additional parameters, auxiliary routines, or post-generation verification.<n>A fine-tuned DMTD Qwen3-4B model has already demonstrated promising results, achieving up to a 2x speedup with only minor performance loss.
arXiv Detail & Related papers (2025-10-13T21:42:37Z) - Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information [57.397381631496906]
We develop two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP)<n>Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions.<n>We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN.
arXiv Detail & Related papers (2025-10-01T22:21:50Z) - Diffusion Language Models Know the Answer Before Decoding [56.96815863705218]
Diffusion language models (DLMs) have emerged as an alternative to autoregressive approaches.<n>Our work highlights and leverage an overlooked property of DLMs early answer convergence.<n>We introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.
arXiv Detail & Related papers (2025-08-27T15:40:25Z) - Scaling Textual Gradients via Sampling-Based Momentum [59.94928977345951]
The Textual Gradient Descent (TGD) framework has emerged as a promising data-driven approach.<n> scaling the number of training examples improves results but later degrades TGD's performance.<n>We propose Textual Gradient Descent with Momentum (TSGD-M) - a method that facilitates scalable-context learning by reweighting prompt sampling.
arXiv Detail & Related papers (2025-05-31T05:35:45Z) - Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo [55.452453947359736]
We introduce a novel verification method based on Twisted Sequential Monte Carlo (TSMC)<n>We apply TSMC to Large Language Models by estimating the expected future rewards at partial solutions.<n>This approach results in a more straightforward training target that eliminates the need for step-wise human annotations.
arXiv Detail & Related papers (2024-10-02T18:17:54Z) - LiveMind: Low-latency Large Language Models with Simultaneous Inference [9.795240210326346]
We introduce LiveMind, a novel low-latency inference framework for large language model (LLM) inference.
By reallocating computational processes to the input phase, a substantial reduction in latency is achieved.
The framework adeptly manages the visibility of the streaming input to the model, allowing it to infer from incomplete user input or await additional content.
arXiv Detail & Related papers (2024-06-20T13:52:30Z) - RePrompt: Planning by Automatic Prompt Engineering for Large Language Models Agents [27.807695570974644]
We propose a novel method, textscRePrompt, which does agradient descent"-like approach to optimize the step-by-step instructions in the prompts given to LLM agents.<n>By leveraging intermediate feedback, textscRePrompt can optimize the prompt without the need for a final solution checker.
arXiv Detail & Related papers (2024-06-17T01:23:11Z) - Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving [10.926767319124547]
We present Apparate, a system that automatically applies and manages early exits in machine learning models.
To cope with the time-varying overhead and accuracy challenges that EEs bring, Apparate repurposes exits to provide continual feedback.
Apparate lowers median response latencies by 40.5--91.5% and 10.0--24.2% for diverse CV and NLP classification workloads.
arXiv Detail & Related papers (2023-12-08T21:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.