StreamingThinker: Large Language Models Can Think While Reading
- URL: http://arxiv.org/abs/2510.17238v1
- Date: Mon, 20 Oct 2025 07:27:37 GMT
- Title: StreamingThinker: Large Language Models Can Think While Reading
- Authors: Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen,
- Abstract summary: Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning.<n>Inspired by human cognition of thinking while reading, we first design a textittextbfstreaming thinking paradigm for LLMs.<n>We instantiate this paradigm with textitStreamingThinker, a framework that enables LLMs to think while reading.
- Score: 14.54868327561777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this repository.}
Related papers
- Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models [14.21980212001207]
Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs.<n>To better match streaming inputs, we propose textbfThink-as-You-See (TaYS), a unified framework enabling true concurrent reasoning.
arXiv Detail & Related papers (2026-03-03T11:24:55Z) - LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval [74.72139580745511]
LaSER is a novel self-distillation framework that internalizes explicit reasoning into the latent space of retrievers.<n>Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
arXiv Detail & Related papers (2026-03-02T04:11:18Z) - Latent Reasoning with Supervised Thinking States [60.09942890192309]
Reasoning with a chain-of-thought (CoT) enables Large Language Models (LLMs) to solve complex tasks but incurs significant inference costs.<n>We propose Thinking States, a method that performs reasoning em while the input is processing.<n>We show Thinking States leads to stronger reasoning behavior than CoT, successfully extrapolating to longer sequences than seen during training.
arXiv Detail & Related papers (2026-02-09T07:12:41Z) - Rethinking Chain-of-Thought Reasoning for Videos [19.579424881079447]
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing.<n>Recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning.<n>Motivated by empirical observations, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning.
arXiv Detail & Related papers (2025-12-10T13:05:55Z) - Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning [11.437063355666593]
We propose an efficient collaborative reasoning framework, leveraging DLMs to generate candidate thoughts and LLMs to evaluate their quality.<n>Our framework achieves strong performance in complex reasoning tasks, offering a promising direction for future research.
arXiv Detail & Related papers (2025-10-31T13:41:30Z) - MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts [82.46857666702924]
We present a new paradigm for reasoning in large language models (LLMs)<n>Instead of autoregressively generating tokens, we model reasoning as a hidden Markov chain of continuous, high-dimensional "thoughts"<n>For the first time, MARCOS achieves performance comparable to token-based CoT, even surpassing it by 4.7% on GSM8K with up to 15.7x speedup in inference.
arXiv Detail & Related papers (2025-09-29T16:44:22Z) - Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs [52.663816303997194]
A key factor influencing answer quality is the length of the thinking stage.<n>This paper explores and exploits the mechanisms by which LLMs understand and regulate the length of their reasoning.<n>Our results demonstrate that this "overclocking" method mitigates overthinking, improves answer accuracy, and reduces inference latency.
arXiv Detail & Related papers (2025-06-08T17:54:33Z) - LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding [29.586274567275012]
It is commonly assumed that the latter two mismatches require frequent re-encoding, indicating re-encoding outputs is largely unnecessary.<n>We introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes.<n>Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes.
arXiv Detail & Related papers (2025-05-22T17:53:28Z) - Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space [62.54887038032942]
We introduce Soft Thinking, a training-free method that emulates human-like "soft" reasoning by generating soft, abstract concept tokens.<n>These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space.<n>In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge.
arXiv Detail & Related papers (2025-05-21T17:29:15Z) - Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [64.74765550805024]
Chain-of-Thought prompting elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs.<n>We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints.<n>SoT achieves token reductions of up to 84% with minimal accuracy loss across 18 reasoning datasets.
arXiv Detail & Related papers (2025-03-07T06:57:17Z) - SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks.<n>We propose a novel approach for continuous-space reasoning that does not require modifying the LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.