LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings
- URL: http://arxiv.org/abs/2512.07522v1
- Date: Mon, 08 Dec 2025 12:59:24 GMT
- Title: LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings
- Authors: Sebastian Sztwiertnia, Felix Friedrich, Kristian Kersting, Patrick Schramowski, Björn Deiseroth,
- Abstract summary: We propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties.<n>LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead.<n>In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation.
- Score: 44.57551925823648
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.
Related papers
- Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining [45.51273144181658]
We investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality.<n>We introduce metadata appending as a means of improving training efficiency.<n>We analyze latent representations to understand how metadata shapes learning.
arXiv Detail & Related papers (2025-11-26T17:36:31Z) - Reusing Pre-Training Data at Test Time is a Compute Multiplier [35.81885343245217]
We quantify how much dataset value was left behind by the process of pre-training.<n>We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains.<n>These results can be further improved by leveraging additional compute at test time to parse the retrieved context.
arXiv Detail & Related papers (2025-11-06T10:10:43Z) - Thinking Augmented Pre-training [88.04395622064708]
Thinking augmented Pre-Training is a universal methodology that augments text with automatically generated thinking trajectories.<n>This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.
arXiv Detail & Related papers (2025-09-24T14:45:13Z) - MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models [44.458342094004024]
High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs)<n>We introduce MASS, a textbfMAthematical data textbfSelection framework using the textbfSkill graph for pretraining LLMs.<n> Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes.
arXiv Detail & Related papers (2025-03-19T05:50:21Z) - Metadata Conditioning Accelerates Language Model Pre-training [76.54265482251454]
We propose a new method, termed Metadata Conditioning then Cooldown (MeCo) to incorporate additional learning cues during pre-training.<n>MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM)<n>MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.
arXiv Detail & Related papers (2025-01-03T18:59:23Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Investigating Pre-trained Language Models on Cross-Domain Datasets, a
Step Closer to General AI [0.8889304968879164]
We investigate the ability of pre-trained language models to generalize to different non-language tasks.
The four pre-trained models that we used, T5, BART, BERT, and GPT-2 achieve outstanding results.
arXiv Detail & Related papers (2023-06-21T11:55:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.