Related papers: AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

URL: http://arxiv.org/abs/2603.01914v1
Date: Mon, 02 Mar 2026 14:28:16 GMT
Title: AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
Authors: Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin,
Abstract summary: AdaPonderLM is a self-supervised recurrent language model that learns token-wise early exiting during pretraining.<n>AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy.
Score: 23.442686851761298
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.

Related papers

Pretraining with Token-Level Adaptive Latent Chain-of-Thought [44.19871205975474]
Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs.<n>This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining.<n>We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token.<n>Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FL
arXiv Detail & Related papers (2026-02-09T02:49:15Z)
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation [12.503747711792679]
ConceptMoE dynamically merges semantically similar tokens into concept representations.<n>A learnable chunk module identifies optimal boundaries by measuring inter-token similarity.<n> ConceptMoE consistently outperforms standard MoE across language and vision-language tasks.
arXiv Detail & Related papers (2026-01-29T08:58:22Z)
Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z)
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z)
Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space [38.50132130644233]
Current approaches for scaling inference-time compute in computation transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer.<n>Thoughtbubbles is a transformer variant that performs parallel adaptive computation in latent space by learning to fork or delete residual streams.<n>Thoughtbubbles outperforms both standard decoder LMs and non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA.
arXiv Detail & Related papers (2025-09-30T19:49:15Z)
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs [1.1834200163382398]
ReGATE (Reference$-$Guided Adaptive Token Elision) is an adaptive token pruning method for accelerating MLLM training.<n>It matches the peak accuracy of standard training on MVBench up to 2$times$ faster, using only 35% of the tokens.
arXiv Detail & Related papers (2025-07-29T01:07:09Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining [53.893792844055106]
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency.<n>We introduce Selective Efficient Language Modeling, a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection.<n> Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines.
arXiv Detail & Related papers (2025-05-26T12:23:26Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.