AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
- URL: http://arxiv.org/abs/2603.01914v1
- Date: Mon, 02 Mar 2026 14:28:16 GMT
- Title: AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
- Authors: Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin,
- Abstract summary: AdaPonderLM is a self-supervised recurrent language model that learns token-wise early exiting during pretraining.<n>AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy.
- Score: 23.442686851761298
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.
Related papers
- Pretraining with Token-Level Adaptive Latent Chain-of-Thought [44.19871205975474]
Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs.<n>This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining.<n>We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token.<n>Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FL
arXiv Detail & Related papers (2026-02-09T02:49:15Z) - ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation [12.503747711792679]
ConceptMoE dynamically merges semantically similar tokens into concept representations.<n>A learnable chunk module identifies optimal boundaries by measuring inter-token similarity.<n> ConceptMoE consistently outperforms standard MoE across language and vision-language tasks.
arXiv Detail & Related papers (2026-01-29T08:58:22Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z) - Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space [38.50132130644233]
Current approaches for scaling inference-time compute in computation transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer.<n>Thoughtbubbles is a transformer variant that performs parallel adaptive computation in latent space by learning to fork or delete residual streams.<n>Thoughtbubbles outperforms both standard decoder LMs and non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA.
arXiv Detail & Related papers (2025-09-30T19:49:15Z) - ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs [1.1834200163382398]
ReGATE (Reference$-$Guided Adaptive Token Elision) is an adaptive token pruning method for accelerating MLLM training.<n>It matches the peak accuracy of standard training on MVBench up to 2$times$ faster, using only 35% of the tokens.
arXiv Detail & Related papers (2025-07-29T01:07:09Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining [53.893792844055106]
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency.<n>We introduce Selective Efficient Language Modeling, a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection.<n> Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines.
arXiv Detail & Related papers (2025-05-26T12:23:26Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.