Temporal Guidance for Large Language Models
- URL: http://arxiv.org/abs/2601.21744v1
- Date: Thu, 29 Jan 2026 14:01:00 GMT
- Title: Temporal Guidance for Large Language Models
- Authors: Hong-Kai Zheng, Piji Li,
- Abstract summary: We propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Conditional Guidance (TeGu)<n>Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast.<n>Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.
- Score: 22.7968403903992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.
Related papers
- Beyond Real Weights: Hypercomplex Representations for Stable Quantization [6.708338010963415]
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations.<n>We introduce a progressive re parameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks.<n>A residual schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training.
arXiv Detail & Related papers (2025-12-09T12:10:57Z) - Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z) - Fast and Expressive Multi-Token Prediction with Probabilistic Circuits [29.853857313543468]
Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs)<n>We investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs)<n>Our framework, named MTPC, allows one to explore different ways to encode the joint distributions over future tokens.
arXiv Detail & Related papers (2025-11-14T14:33:14Z) - OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation [91.45421429922506]
OneCAT is a unified multimodal model that seamlessly integrates understanding, generation, and editing.<n>Our framework eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference.
arXiv Detail & Related papers (2025-09-03T17:29:50Z) - Mamba Drafters for Speculative Decoding [58.080550222549064]
We introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM)<n>By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods.<n>We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates.
arXiv Detail & Related papers (2025-06-01T22:52:47Z) - LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z) - Latent Thought Models with Variational Bayes Inference-Time Computation [52.63299874322121]
Latent Thought Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs demonstrate superior sample and parameter efficiency compared to autoregressive models and discrete diffusion models.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.