Not All Tokens Are What You Need In Thinking
- URL: http://arxiv.org/abs/2505.17827v2
- Date: Sat, 02 Aug 2025 10:54:47 GMT
- Title: Not All Tokens Are What You Need In Thinking
- Authors: Hang Yuan, Bin Yu, Haotian Li, Shijun Yang, Christina Dan Wang, Zhou Yu, Xueyin Xu, Weizhen Qi, Kai Chen,
- Abstract summary: Conditional Token Selection (CTS) identifies and preserves only the most essential tokens in chains of thought.<n>CTS effectively compresses long CoT while maintaining strong reasoning performance.<n>Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens.
- Score: 34.767739567093656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern reasoning models, such as OpenAI's o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking -- generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues, we propose Conditional Token Selection (CTS), a token-level compression framework with a flexible and variable compression ratio that identifies and preserves only the most essential tokens in CoT. CTS evaluates each token's contribution to deriving correct answers using conditional importance scoring, then trains models on compressed CoT. Extensive experiments demonstrate that CTS effectively compresses long CoT while maintaining strong reasoning performance. Notably, on the GPQA benchmark, Qwen2.5-14B-Instruct trained with CTS achieves a 9.1% accuracy improvement with 13.2% fewer reasoning tokens (13% training token reduction). Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens, highlighting the prevalence of redundancy in existing CoT.
Related papers
- Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers [11.916258576313776]
Vision Transformer (ViT) has achieved impressive results across various vision tasks.<n>Recent methods have aimed to reduce ViT's $O(n2)$ complexity by pruning unimportant tokens.<n>We introduce a novel bf Block-based Symmetric Pruning and Fusion for efficient ViT.
arXiv Detail & Related papers (2025-07-16T10:48:56Z) - VeriThinker: Learning to Verify Makes Reasoning Model Efficient [52.74493506816969]
Large Reasoning Models excel at complex tasks using Chain-of-Thought (CoT) reasoning.<n>Their tendency to overthinking leads to unnecessarily lengthy reasoning chains.<n>We introduce VeriThinker, a novel approach for CoT compression.
arXiv Detail & Related papers (2025-05-23T14:17:56Z) - R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search [61.4807238517108]
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving.<n>CoT's extension to Long-CoT introduces substantial computational overhead due to increased token length.<n>We propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence.
arXiv Detail & Related papers (2025-05-22T16:06:59Z) - Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping [3.521097198612099]
Adaptive GoGI-Skip is a novel framework learning dynamic CoT compression via supervised fine-tuning.<n>It achieves substantial efficiency gains - reducing CoT token counts by over 45% on average and delivering 1.6-2.0 times inference speedups.<n> Notably, it significantly outperforms existing baselines by preserving accuracy even at high effective compression rates.
arXiv Detail & Related papers (2025-05-13T09:39:18Z) - Hawkeye:Efficient Reasoning with Model Collaboration [7.26791045376255]
Chain-of-Thought (CoT) reasoning has demonstrated remarkable effectiveness in enhancing the reasoning abilities of large language models (LLMs)<n>Most CoT tokens are unnecessary, and retaining only a small portion of them is sufficient for producing high-quality responses.<n>We propose HAWKEYE, a novel post-training and inference framework where a large model produces concise CoT instructions to guide a smaller model in response generation.
arXiv Detail & Related papers (2025-04-01T05:09:04Z) - TokenSkip: Controllable Chain-of-Thought Compression in LLMs [11.583847083770031]
Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression.
arXiv Detail & Related papers (2025-02-17T17:37:26Z) - CoT-Valve: Length-Compressible Chain-of-Thought Tuning [50.196317781229496]
We introduce a new tuning and inference strategy named CoT-Valve, designed to allow models to generate reasoning chains of varying lengths.<n>We show that CoT-Valve successfully enables controllability and compressibility of the chain and shows better performance than the prompt-based control.
arXiv Detail & Related papers (2025-02-13T18:52:36Z) - Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation [43.09801987385207]
Contrastive Language-Image Pretraining (CLIP) excels at learning generalizable image representations but often falls short in zero-shot inference on certain datasets.<n>Test-time adaptation (TTA) mitigates this issue by adjusting components like normalization layers or context prompts, yet it typically requires large batch sizes and extensive augmentations.<n>We propose Token Condensation as Adaptation (TCA), a training-free adaptation method that takes a step beyond standard TC.
arXiv Detail & Related papers (2024-10-16T07:13:35Z) - Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs)
However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages.
We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.