How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach
- URL: http://arxiv.org/abs/2503.01141v2
- Date: Tue, 01 Apr 2025 00:41:36 GMT
- Title: How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach
- Authors: Ayeong Lee, Ethan Che, Tianyi Peng,
- Abstract summary: We conduct the first systematic study of the relationship between reasoning length and model performance.<n>We show that this tradeoff persists across even very distinct reasoning chains.<n>We show that prompt-based compression strategies operate far from theoretical limits.
- Score: 4.055489363682199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-thought prompting has emerged as a powerful technique for enabling large language models (LLMs) to solve complex reasoning tasks. However, these reasoning chains can be verbose, raising concerns about efficiency. In response, recent works have sought to decrease response lengths through simple prompting strategies (e.g. 'be concise'). In this work, we conduct the first systematic study of the relationship between reasoning length and model performance across a diverse range of compression instructions (e.g. 'use 10 words or less' or 'remove all punctuation'). In doing so, we discover a universal tradeoff between reasoning length and accuracy that persists across even very distinct reasoning chains. We demonstrate that this tradeoff emerges from a sharp threshold behavior at the question level: each task has an intrinsic 'token complexity' - a minimal number of tokens required for successful problem-solving. We show how token complexity enables us to compute information-theoretic limits on the accuracy-compression tradeoff, and find that prompt-based compression strategies operate far from these theoretical limits. This suggests there may be significant room for improvement and our framework provides a benchmark to help researchers evaluate progress in reasoning efficiency. Our work also highlights the importance of adaptive compression -- giving shorter responses for easier questions -- and we show that token complexity is a useful tool for measuring this capability.
Related papers
- Efficient Reasoning Models: A Survey [52.96232442322824]
This survey aims to provide a comprehensive overview of recent advances in efficient reasoning.
It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities; and (3) faster.
arXiv Detail & Related papers (2025-04-15T06:28:00Z) - Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? [72.70486097967124]
We formalize a framework using deterministic finite automata (DFAs)
We show that there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized.
We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
arXiv Detail & Related papers (2025-04-02T17:45:58Z) - Chain of Draft: Thinking Faster by Writing Less [37.492654173517046]
Chain of Draft (CoD) is a novel paradigm inspired by human cognitive processes.<n>CoD generates minimalistic yet informative intermediate reasoning outputs while solving tasks.
arXiv Detail & Related papers (2025-02-25T19:36:06Z) - LightThinker: Thinking Step-by-Step Compression [53.8069487638972]
We propose LightThinker, a method that enables large language models to dynamically compress intermediate thoughts during reasoning.<n>Inspired by human cognitive processes, LightThinker compresses thought steps into compact representations and discards the original reasoning chains.<n>Experiments show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-02-21T16:57:22Z) - A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression [41.71994217868039]
We show that gist-based compression can achieve near-lossless performance on tasks like retrieval-augmented generation and long-document QA.<n>We identify three key failure patterns: lost by the boundary, lost if surprise, and lost along the way.<n>We propose two effective strategies: fine-grained autoencoding, which enhances the reconstruction of original token information, and segment-wise token importance estimation.
arXiv Detail & Related papers (2024-12-23T11:24:04Z) - Markov Chain of Thought for Efficient Mathematical Reasoning [10.678633785012691]
Chain of Thought (CoT) of multi-step benefits from the logical structure of the reasoning steps and task-specific actions.
We conceptualize the standard multi-step CoT as a novel Markov Chain of Thought (MCoT)
arXiv Detail & Related papers (2024-10-23T07:53:29Z) - To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [55.52872152909785]
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs)
We show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.
arXiv Detail & Related papers (2024-09-18T17:55:00Z) - Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks.
We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks.
We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z) - DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs)
We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z) - The Impact of Reasoning Step Length on Large Language Models [40.546685248243534]
Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models.
We investigate the correlation between the effectiveness of CoT and the length of reasoning steps in prompts.
arXiv Detail & Related papers (2024-01-10T04:37:38Z) - Complexity-Based Prompting for Multi-Step Reasoning [72.0057198610614]
We study the task of prompting large-scale language models to perform multi-step reasoning.
A central question is which reasoning examples make the most effective prompts.
We propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning.
arXiv Detail & Related papers (2022-10-03T05:33:27Z) - Least-to-Most Prompting Enables Complex Reasoning in Large Language
Models [52.59923418570378]
We propose a novel prompting strategy, least-to-most prompting, to overcome the challenge of easy-to-hard generalization.
We show that least-to-most prompting is capable of generalizing to more difficult problems than those seen in prompts.
neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples.
arXiv Detail & Related papers (2022-05-21T15:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.