ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model
- URL: http://arxiv.org/abs/2601.22730v1
- Date: Fri, 30 Jan 2026 09:06:45 GMT
- Title: ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model
- Authors: Xiaoshu Chen, Sihang Zhou, Ke Liang, Taichun Zhou, Xinwang Liu,
- Abstract summary: Long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs)<n>We propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images.<n>This substitutes linguistic bias with spatial inductive bias, enabling latent tokens to better capture global reasoning structure.
- Score: 34.90582960625524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.
Related papers
- Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs [20.946228883628013]
We propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction.<n>We first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens.<n>We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism.
arXiv Detail & Related papers (2026-02-26T07:20:40Z) - CoLT: Reasoning with Chain of Latent Tool Calls [31.228763375347608]
Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs)<n>We propose CoLT, a novel framework that implements latent reasoning as tool calls''
arXiv Detail & Related papers (2026-02-04T06:12:53Z) - CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction [50.67483317563736]
This paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results.<n>We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction.
arXiv Detail & Related papers (2026-01-24T11:41:54Z) - Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning [23.364264811510598]
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs)<n>We introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images.<n>Our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT.
arXiv Detail & Related papers (2026-01-21T08:09:25Z) - Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models [82.79223371188756]
Chain-of-Thought (CoT) prompting has advanced task-solving capabilities in natural language processing with large language models.<n>Applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible.<n>We introduce pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning.
arXiv Detail & Related papers (2025-12-24T05:25:17Z) - Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens [12.946160260124378]
Contrastive Language-Image Pre-training delivers strong cross modal generalization.<n>It persistently fails at compositional reasoning over objects, attributes, and relations.<n>We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment.
arXiv Detail & Related papers (2025-10-30T09:41:21Z) - Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [64.74765550805024]
Chain-of-Thought prompting elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs.<n>We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints.<n>SoT achieves token reductions of up to 84% with minimal accuracy loss across 18 reasoning datasets.
arXiv Detail & Related papers (2025-03-07T06:57:17Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - Efficient Reasoning with Hidden Thinking [48.96945580741641]
Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities.<n>We propose $textbfHeima$ (as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space.<n>Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy.
arXiv Detail & Related papers (2025-01-31T15:10:29Z) - Training Large Language Models to Reason in a Continuous Latent Space [71.0274000348354]
We introduce a new paradigm called Coconut (Chain of Continuous Thought) to explore the potential of reasoning beyond language.<n>Instead of decoding this state into words, we feed it back to the model as the next input embedding directly in the continuous space.<n>This latent reasoning paradigm enables an advanced reasoning pattern, where continuous thoughts can encode multiple alternative next steps.
arXiv Detail & Related papers (2024-12-09T18:55:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.