Language Modeling with Learned Meta-Tokens
- URL: http://arxiv.org/abs/2509.16278v1
- Date: Thu, 18 Sep 2025 17:38:48 GMT
- Title: Language Modeling with Learned Meta-Tokens
- Authors: Alok N. Shah, Khush Gupta, Keshav Ramji, Pratik Chaudhari,
- Abstract summary: This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens.<n>We find that data-efficient language model pre-training on fewer than 100B tokens utilizing meta-tokens achieves strong performance on these tasks after fine-tuning.
- Score: 15.860245999620409
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to capture long-range dependencies within their context window. This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model with a modified GPT-2 architecture equipped with meta-attention in addition to causal multi-head attention, and study the impact of these tokens on a suite of synthetic tasks. We find that data-efficient language model pre-training on fewer than 100B tokens utilizing meta-tokens and our meta-attention mechanism achieves strong performance on these tasks after fine-tuning. We suggest that these gains arise due to the meta-tokens sharpening the positional encoding. This enables them to operate as trainable, content-based landmarks, implicitly compressing preceding context and "caching" it in the meta-token. At inference-time, the meta-token points to relevant context, facilitating length generalization up to 2$\times$ its context window, even after extension with YaRN. We provide further evidence of these behaviors by visualizing model internals to study the residual stream, and assessing the compression quality by information-theoretic analysis on the rate-distortion tradeoff. Our findings suggest that pre-training LMs with meta-tokens offers a simple, data-efficient method to enhance long-context language modeling performance, while introducing new insights into the nature of their behavior towards length generalization.
Related papers
- Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs [20.946228883628013]
We propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction.<n>We first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens.<n>We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism.
arXiv Detail & Related papers (2026-02-26T07:20:40Z) - Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs [69.28193153685893]
Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning.<n>To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features.<n>Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models.
arXiv Detail & Related papers (2026-01-16T06:29:07Z) - STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions [4.169671705130711]
We propose STELLA, a framework that mines and injects structured supplementary and complementary information.<n>StELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components.<n>Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting.
arXiv Detail & Related papers (2025-12-04T14:56:36Z) - Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation [34.21806963402883]
We study the untapped potential of context compression as a pretext task for unsupervised adaptation of large language models (LLMs)<n> Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations.<n>Further improvements through contrastive learning produce a strong representation model (LLM2Comp)
arXiv Detail & Related papers (2025-11-21T10:45:44Z) - Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z) - Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization [56.083511902353365]
Reinforcement learning (RL) typically applies uniform credit across an entire generation of Large language models.<n>This work positions attention as a privileged substrate that renders the internal logic of LLMs as a mechanistic blueprint of reasoning itself.<n>We introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes.
arXiv Detail & Related papers (2025-10-15T13:49:51Z) - Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z) - Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning [20.801571525710834]
Token Internal Position Awareness (TIPA) is a method that significantly improves models' ability to capture character positions within tokens.<n>TIPA enhances position prediction accuracy in large language models, enabling more precise identification of target characters in original text.
arXiv Detail & Related papers (2024-11-26T18:44:39Z) - Context-Aware Meta-Learning [52.09326317432577]
We propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning.
Our approach exceeds or matches the state-of-the-art algorithm, P>M>F, on 8 out of 11 meta-learning benchmarks.
arXiv Detail & Related papers (2023-10-17T03:35:27Z) - Induced Natural Language Rationales and Interleaved Markup Tokens Enable
Extrapolation in Large Language Models [8.166629393064097]
The ability to extrapolate, i.e., to make predictions on sequences that are longer than those presented as training examples, is a challenging problem for deep learning models.
Recent work shows that this limitation persists in state-of-the-art Transformer-based models.
We demonstrate that large language models can succeed in extrapolation without modifying their architecture or training procedure.
arXiv Detail & Related papers (2022-08-24T11:25:27Z) - MetaICL: Learning to Learn In Context [87.23056864536613]
We introduce MetaICL, a new meta-training framework for few-shot learning where a pretrained language model is tuned to do in-context learn-ing on a large set of training tasks.
We show that MetaICL approaches (and sometimes beats) the performance of models fully finetuned on the target task training data, and outperforms much bigger models with nearly 8x parameters.
arXiv Detail & Related papers (2021-10-29T17:42:08Z) - Self-Supervised Meta-Learning for Few-Shot Natural Language
Classification Tasks [40.97125791174191]
We propose a self-supervised approach to generate a large, rich, meta-learning task distribution from unlabeled text.
We show that this meta-training leads to better few-shot generalization than language-model pre-training followed by finetuning.
arXiv Detail & Related papers (2020-09-17T17:53:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.