Related papers: Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping

Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping

URL: http://arxiv.org/abs/2505.08392v2
Date: Sat, 17 May 2025 14:59:34 GMT
Title: Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping
Authors: Ren Zhuang, Ben Wang, Shuifa Sun,
Abstract summary: Adaptive GoGI-Skip is a novel framework learning dynamic CoT compression via supervised fine-tuning.<n>It achieves substantial efficiency gains - reducing CoT token counts by over 45% on average and delivering 1.6-2.0 times inference speedups.<n> Notably, it significantly outperforms existing baselines by preserving accuracy even at high effective compression rates.
Score: 3.521097198612099
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models leverage Chain-of-Thought (CoT) prompting for complex tasks, but their reasoning traces are often excessively verbose and inefficient, leading to significant computational costs and latency. Current CoT compression techniques typically rely on generic importance metrics and static compression rates, which may inadvertently remove functionally critical tokens or fail to adapt to varying reasoning complexity. To overcome these limitations, we propose Adaptive GoGI-Skip, a novel framework learning dynamic CoT compression via supervised fine-tuning. This approach introduces two synergistic innovations: (1) Goal-Gradient Importance (GoGI), a novel metric accurately identifying functionally relevant tokens by measuring the gradient influence of their intermediate representations on the final answer loss, and (2) Adaptive Dynamic Skipping (ADS), a mechanism dynamically regulating the compression rate based on runtime model uncertainty while ensuring local coherence through an adaptive N-token constraint. To our knowledge, this is the first work unifying a goal-oriented, gradient-based importance metric with dynamic, uncertainty-aware skipping for CoT compression. Trained on compressed MATH data, Adaptive GoGI-Skip demonstrates strong cross-domain generalization across diverse reasoning benchmarks including AIME, GPQA, and GSM8K. It achieves substantial efficiency gains - reducing CoT token counts by over 45% on average and delivering 1.6-2.0 times inference speedups - while maintaining high reasoning accuracy. Notably, it significantly outperforms existing baselines by preserving accuracy even at high effective compression rates, advancing the state of the art in the CoT reasoning efficiency-accuracy trade-off.

Related papers

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
EKPC: Elastic Knowledge Preservation and Compensation for Class-Incremental Learning [53.88000987041739]
Class-Incremental Learning (CIL) aims to enable AI models to continuously learn from sequentially arriving data of different classes over time.<n>We propose the Elastic Knowledge Preservation and Compensation (EKPC) method, integrating Importance-aware importance Regularization (IPR) and Trainable Semantic Drift Compensation (TSDC) for CIL.
arXiv Detail & Related papers (2025-06-14T05:19:58Z)
Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models [9.282278040339138]
Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks.<n>We make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning.<n>We propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks.
arXiv Detail & Related papers (2025-06-06T11:53:27Z)
Mechanistic Insights into Grokking from the Embedding Layer [15.676058752772287]
Grokking, a delayed generalization in neural networks, has been observed in Transformers and stagnates, but the components driving it remain underexplored.<n>We show that embeddings are central to grokking: introducing them intos induces delayed generalization in modular arithmetic tasks.<n>Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.
arXiv Detail & Related papers (2025-05-21T15:12:34Z)
Outlier-aware Tensor Robust Principal Component Analysis with Self-guided Data Augmentation [21.981038455329013]
We propose a self-guided data augmentation approach that employs adaptive weighting to suppress outlier influence.<n>We show the improvements in both accuracy and computational efficiency compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-04-25T13:03:35Z)
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning [14.020244011380063]
SpecReason is a system that accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps.<n>It achieves 1.5-2.5$times$ speedup over vanilla LRM inference while improving accuracy by 1.0-9.9%.<n>Compared to speculative decoding without SpecReason, their combination yields an additional 19.4-44.2% latency reduction.
arXiv Detail & Related papers (2025-04-10T16:05:19Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)
Causal Context Adjustment Loss for Learned Image Compression [72.7300229848778]
In recent years, learned image compression (LIC) technologies have surpassed conventional methods notably in terms of rate-distortion (RD) performance. Most present techniques are VAE-based with an autoregressive entropy model, which obviously promotes the RD performance by utilizing the decoded causal context. In this paper, we make the first attempt in investigating the way to explicitly adjust the causal context with our proposed Causal Context Adjustment loss.
arXiv Detail & Related papers (2024-10-07T09:08:32Z)
Parameter-Efficient Fine-Tuning for Continual Learning: A Neural Tangent Kernel Perspective [125.00228936051657]
We introduce NTK-CL, a novel framework that eliminates task-specific parameter storage while adaptively generating task-relevant features. By fine-tuning optimizable parameters with appropriate regularization, NTK-CL achieves state-of-the-art performance on established PEFT-CL benchmarks.
arXiv Detail & Related papers (2024-07-24T09:30:04Z)
Effective Interplay between Sparsity and Quantization: From Theory to Practice [33.697590845745815]
We show how sparsity and quantization interact when combined together.<n>We show that even if applied in the correct order, the compounded errors from sparsity and quantization can significantly harm accuracy.<n>Our findings extend to the efficient deployment of large models in resource-constrained compute platforms.
arXiv Detail & Related papers (2024-05-31T15:34:13Z)
Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding [55.107555305760954]
We propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion. Our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.
arXiv Detail & Related papers (2024-01-17T09:01:29Z)
Achieving Constraints in Neural Networks: A Stochastic Augmented Lagrangian Approach [49.1574468325115]
Regularizing Deep Neural Networks (DNNs) is essential for improving generalizability and preventing overfitting. We propose a novel approach to DNN regularization by framing the training process as a constrained optimization problem. We employ the Augmented Lagrangian (SAL) method to achieve a more flexible and efficient regularization mechanism.
arXiv Detail & Related papers (2023-10-25T13:55:35Z)
Optimal Rate Adaption in Federated Learning with Compressed Communications [28.16239232265479]
Federated Learning incurs high communication overhead, which can be greatly alleviated by compression for model updates. tradeoff between compression and model accuracy in the networked environment remains unclear. We present a framework to maximize the final model accuracy by strategically adjusting the compression each iteration.
arXiv Detail & Related papers (2021-12-13T14:26:15Z)
Compressing gradients by exploiting temporal correlation in momentum-SGD [17.995905582226463]
We analyze compression methods that exploit temporal correlation in systems with and without error-feedback. Experiments with the ImageNet dataset demonstrate that our proposed methods offer significant reduction in the rate of communication. We prove the convergence of SGD under an expected error assumption by establishing a bound for the minimum gradient norm.
arXiv Detail & Related papers (2021-08-17T18:04:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.