Related papers: Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding

Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding

URL: http://arxiv.org/abs/2409.08561v1
Date: Fri, 13 Sep 2024 06:29:20 GMT
Title: Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding
Authors: Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, Weiqi Luo,
Abstract summary: Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring chain-of-thought (CoT) prompting. generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference. We propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning.
Score: 14.175444025026508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring reasoning and multi-step problem-solving through the use of chain-of-thought (CoT) prompting. However, generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference. To address this challenge, we propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning. Our method introduces an auxiliary CoT model that learns to generate and compress the full thought process into a compact special token representation semantically aligned with the original CoT output. This compressed representation is then integrated into the input of the Hidden Chain-of-Thought (HCoT) model. The training process follows a two-stage procedure: First, the CoT model is optimized to generate the compressed token representations aligned with the ground-truth CoT outputs using a contrastive loss. Subsequently, with the CoT model parameters frozen, the HCoT model is fine-tuned to generate accurate subsequent predictions conditioned on the prefix instruction and the compressed CoT representations from the CoT model. Extensive experiments across three challenging domains - mathematical reasoning, agent invocation, and question answering - demonstrate that our semantic compression approach achieves competitive or improved performance compared to the full CoT baseline, while providing significant speedups of at least 1.5x in decoding time. Moreover, incorporating contrastive learning objectives further enhances the quality of the compressed representations, leading to better CoT prompting and improved task accuracy. Our work paves the way for more efficient exploitation of multi-step reasoning capabilities in LLMs across a wide range of applications.

Related papers

Efficient Reasoning Models: A Survey [52.96232442322824]
This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities; and (3) faster.
arXiv Detail & Related papers (2025-04-15T06:28:00Z)
Efficient Reasoning with Hidden Thinking [48.96945580741641]
Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities. We propose $textbfHeima$ (as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy.
arXiv Detail & Related papers (2025-01-31T15:10:29Z)
C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness [18.073777359647515]
Chain-of-Thought (CoT) before deriving the answer can improve the reasoning capabilities of large language models (LLMs) However, the length of the generated CoT is much longer than the desired final answer, which results in additional decoding costs. This paper presents a CoT compression framework that involves a compressor to compress an original longer CoT into a shorter CoT.
arXiv Detail & Related papers (2024-12-16T11:12:45Z)
A Theoretical Perspective for Speculative Decoding Algorithm [60.79447486066416]
One effective way to accelerate inference is emphSpeculative Decoding, which employs a small model to sample a sequence of draft tokens and a large model to validate. This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, emphoutput quality and inference acceleration, from a theoretical perspective.
arXiv Detail & Related papers (2024-10-30T01:53:04Z)
From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency [17.612497960364916]
Chain-of-thought (CoT) significantly enhances the reasoning performance of large language models (LLM) We demonstrate that CoT can substantially improve sample efficiency even when representation power is sufficient. We show that CoT simplifies the learning process by introducing sparse dependencies among input tokens, and leads to a sparse and interpretable attention.
arXiv Detail & Related papers (2024-10-07T19:45:09Z)
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527837]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps. We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z)
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [55.52872152909785]
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs) We show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.
arXiv Detail & Related papers (2024-09-18T17:55:00Z)
Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods [59.779795063072655]
Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems. We analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity.
arXiv Detail & Related papers (2024-08-25T04:07:18Z)
Markovian Transformers for Informative Language Modeling [0.9642500063568188]
Chain-of-Thought (CoT) reasoning holds great promise for explaining the outputs of language models. Recent studies have highlighted significant challenges in its practical application for interpretability. We propose a technique to factor next-token prediction through intermediate CoT text, ensuring the CoT is causally load-bearing.
arXiv Detail & Related papers (2024-04-29T17:36:58Z)
ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs) Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z)
CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference [36.753384415107774]
Scaling language models to larger and deeper sizes has led to significant boosts in performance. We propose CoTFormer, a novel architecture which closely mimics Chain-of-Thought (CoT) at the token level. We show that it is possible to reduce the computation cost significantly without any reduction in accuracy.
arXiv Detail & Related papers (2023-10-16T21:37:34Z)
Stress Testing Chain-of-Thought Prompting for Large Language Models [0.16317061277456998]
This report examines the effectiveness of Chain-of-Thought (CoT) prompting in improving the multi-step reasoning abilities of large language models (LLMs) We analyze the impact of three types of CoT prompt perturbations, namely CoT order, CoT values, and CoT operators on the performance of GPT-3 on various tasks.
arXiv Detail & Related papers (2023-09-28T17:21:33Z)
Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side. By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample. We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.