Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
- URL: http://arxiv.org/abs/2507.16784v1
- Date: Tue, 22 Jul 2025 17:30:04 GMT
- Title: Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
- Authors: Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass,
- Abstract summary: We propose the Thread Inference Model (TIM), a family of large language models, and TIMRUN, an inference runtime.<n>Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference.<n>We show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory.
- Score: 11.810780820457927
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.
Related papers
- PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - Reasoning on Multiple Needles In A Haystack [9.765859280987053]
We tackle the memory-based answering problem by filtering out direct-answer questions.<n>We build on this insight to introduce a reflection mechanism for multi-round extension.<n>We train a model using the generated iterative thinking process, which helps mitigate the performance degradation.
arXiv Detail & Related papers (2025-04-05T11:58:08Z) - Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks? [31.192647141427233]
We propose MemReasoner, a new and simple memory-augmented LLM architecture.<n>MemReasoner is trained end-to-end, with optional supporting fact supervision.<n>We show strong generalization of MemReasoner on both single- and two-hop tasks.
arXiv Detail & Related papers (2025-03-10T22:48:53Z) - Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [60.04718679054704]
Chain-of-Thought prompting elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs.<n>We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints.<n>SoT achieves token reductions of up to 78% with minimal accuracy loss across 15 reasoning datasets.
arXiv Detail & Related papers (2025-03-07T06:57:17Z) - Efficient Reasoning with Hidden Thinking [48.96945580741641]
Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities.<n>We propose $textbfHeima$ (as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space.<n>Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy.
arXiv Detail & Related papers (2025-01-31T15:10:29Z) - Supervised Chain of Thought [5.389461633686935]
Chain of Thought (CoT) prompting offers a promising approach to solving complex reasoning tasks.
One-prompt-for-all approach poses significant challenges for models to generate the correct reasoning steps.
We show how task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance.
arXiv Detail & Related papers (2024-10-18T06:25:27Z) - Reason for Future, Act for Now: A Principled Framework for Autonomous
LLM Agents with Provable Sample Efficiency [53.8779374188643]
We propose a principled framework with provable regret guarantees to orchestrate reasoning and acting.
Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon.
At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.
arXiv Detail & Related papers (2023-09-29T16:36:39Z) - RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit.
Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets.
Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z) - Recognizing Long Grammatical Sequences Using Recurrent Networks
Augmented With An External Differentiable Stack [73.48927855855219]
Recurrent neural networks (RNNs) are a widely used deep architecture for sequence modeling, generation, and prediction.
RNNs generalize poorly over very long sequences, which limits their applicability to many important temporal processing and time series forecasting problems.
One way to address these shortcomings is to couple an RNN with an external, differentiable memory structure, such as a stack.
In this paper, we improve the memory-augmented RNN with important architectural and state updating mechanisms.
arXiv Detail & Related papers (2020-04-04T14:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.