Related papers: Change of Thought: Adaptive Test-Time Computation

Change of Thought: Adaptive Test-Time Computation

URL: http://arxiv.org/abs/2507.13569v1
Date: Thu, 17 Jul 2025 23:12:57 GMT
Title: Change of Thought: Adaptive Test-Time Computation
Authors: Mrinal Mathur, Mike Doan, Barak Pearlmutter, Sergey Plis,
Abstract summary: Self-Transformers recover much of the expressive power of iterative reasoning.<n>Self-Transformers iteratively refine their attention weights to a fixed point.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers evaluated in a single, fixed-depth pass are provably limited in expressive power to the constant-depth circuit class TC0. Running a Transformer autoregressively removes that ceiling -- first in next-token prediction and, more recently, in chain-of-thought reasoning. Both regimes rely on feedback loops that decode internal states into tokens only to re-encode them in subsequent steps. While this "thinking aloud" mirrors human reasoning, biological brains iterate without externalising intermediate states as language. To boost the expressive power of encoder Transformers without resorting to token-level autoregression, we introduce the SELF-Transformer: an encoder layer that iteratively refines its own attention weights to a fixed point. Instead of producing -- in one pass -- the alignment matrix that remixes the input sequence, the SELF-Transformer iteratively updates that matrix internally, scaling test-time computation with input difficulty. This adaptivity yields up to 20\% accuracy gains on encoder-style benchmarks without increasing parameter count, demonstrating that input-adaptive alignment at test time offers substantial benefits for only a modest extra compute budget. Self-Transformers thus recover much of the expressive power of iterative reasoning while preserving the simplicity of pure encoder architectures.

Related papers

Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification [0.0]
BERT embeddings combined with logistic regression outperform neural baselines on LIAR dataset splits.<n>This work positions attention-based token encoders as robust, architecture-centric foundations for veracity tasks.
arXiv Detail & Related papers (2025-11-28T08:32:49Z)
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking [51.154226183713405]
We propose Inner Thinking Transformer, which reimagines layer computations as implicit thinking steps.<n>ITT achieves 96.5% performance of a 466M Transformer using only 162M parameters, reduces training data by 43.2%, and outperforms Transformer/Loop variants in 11 benchmarks.
arXiv Detail & Related papers (2025-02-19T16:02:23Z)
The Expressive Power of Transformers with Chain of Thought [29.839710738657203]
In practice, transformers can be improved by allowing them to use a "chain of thought" or "scratchpad" We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. Our results also imply that linear steps keep transformer decoders within context-sensitive languages.
arXiv Detail & Related papers (2023-10-11T22:35:18Z)
TAPIR: Learning Adaptive Revision for Incremental Natural Language Understanding with a Two-Pass Model [14.846377138993645]
Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers. A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise. We propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy.
arXiv Detail & Related papers (2023-05-18T09:58:19Z)
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer. It accurately predicts the number of output tokens and extract hidden variables. It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z)
Towards More Efficient Insertion Transformer with Fractional Positional Encoding [44.45401243989363]
Auto-regressive neural sequence models have been shown to be effective across text generation tasks. Their left-to-right decoding order prevents generation from being parallelized. Insertion Transformer is an attractive alternative that allows outputting multiple tokens in a single generation step.
arXiv Detail & Related papers (2021-12-12T18:38:27Z)
Finetuning Pretrained Transformers into Variational Autoencoders [0.0]
Text variational autoencoders (VAEs) are notorious for posterior collapse. Transformers have seen limited adoption as components of text VAEs. We present a simple two-phase training scheme to convert a sequence-to-sequence Transformer into a VAE with just finetuning.
arXiv Detail & Related papers (2021-08-05T08:27:26Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module. Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z)
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one. With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.