Related papers: Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

URL: http://arxiv.org/abs/2511.21613v1
Date: Wed, 26 Nov 2025 17:36:31 GMT
Title: Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
Authors: Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy, Martin Jaggi,
Abstract summary: We investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality.<n>We introduce metadata appending as a means of improving training efficiency.<n>We analyze latent representations to understand how metadata shapes learning.
Score: 45.51273144181658
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.

Related papers

LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings [44.57551925823648]
We propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties.<n>LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead.<n>In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation.
arXiv Detail & Related papers (2025-12-08T12:59:24Z)
URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training [33.68104398807581]
We show that only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit.<n>Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.
arXiv Detail & Related papers (2025-05-22T12:01:20Z)
TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use [72.32614703504122]
Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with environments.<n>Standard supervised fine-tuning approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use.<n>We propose TL-Training, a task-feature-based framework that mitigates the effects of suboptimal training data.
arXiv Detail & Related papers (2024-12-20T02:21:36Z)
FREE: Faster and Better Data-Free Meta-Learning [77.90126669914324]
Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of pre-trained models without requiring the original data.<n>We introduce the Faster and Better Data-Free Meta-Learning framework, which contains: (i) a meta-generator for rapidly recovering training tasks from pre-trained models; and (ii) a meta-learner for generalizing to new unseen tasks.
arXiv Detail & Related papers (2024-05-02T03:43:19Z)
Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [117.48444197402858]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing.<n>ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model.<n>We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z)
Fast Few-Shot Classification by Few-Iteration Meta-Learning [173.32497326674775]
We introduce a fast optimization-based meta-learning method for few-shot classification. Our strategy enables important aspects of the base learner objective to be learned during meta-training. We perform a comprehensive experimental analysis, demonstrating the speed and effectiveness of our approach.
arXiv Detail & Related papers (2020-10-01T15:59:31Z)
Incremental Meta-Learning via Indirect Discriminant Alignment [118.61152684795178]
We develop a notion of incremental learning during the meta-training phase of meta-learning. Our approach performs favorably at test time as compared to training a model with the full meta-training set.
arXiv Detail & Related papers (2020-02-11T01:39:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.