URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training
- URL: http://arxiv.org/abs/2505.16570v1
- Date: Thu, 22 May 2025 12:01:20 GMT
- Title: URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training
- Authors: Dongyang Fan, Vinko SabolĨec, Martin Jaggi,
- Abstract summary: We show that only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit.<n>Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.
- Score: 33.68104398807581
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. In addition, we demonstrate that context-aware pretraining enables more controllable generation than context-free pretraining, in a classifier-free guidance fashion. Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.
Related papers
- Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining [45.51273144181658]
We investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality.<n>We introduce metadata appending as a means of improving training efficiency.<n>We analyze latent representations to understand how metadata shapes learning.
arXiv Detail & Related papers (2025-11-26T17:36:31Z) - When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars [34.80529788630565]
latent semantics is one of the key properties that determines the performance of language models.<n>One convenient approach to invoke this ability is to prepend metadata at the beginning of texts in the pre-training data.<n>We show that training with metadata helps improve model's performance when the given context is long enough to infer latent semantics.
arXiv Detail & Related papers (2025-04-24T13:56:43Z) - Organize the Web: Constructing Domains Enhances Pre-Training Data Curation [129.27104172458363]
We develop a framework for organizing web pages in terms of both their topic and format.<n>We automatically annotate pre-training data by distilling annotations from a large language model into efficient curations.<n>Our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods.
arXiv Detail & Related papers (2025-02-14T18:02:37Z) - Metadata Conditioning Accelerates Language Model Pre-training [76.54265482251454]
We propose a new method, termed Metadata Conditioning then Cooldown (MeCo) to incorporate additional learning cues during pre-training.<n>MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM)<n>MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.
arXiv Detail & Related papers (2025-01-03T18:59:23Z) - On the Loss of Context-awareness in General Instruction Fine-tuning [101.03941308894191]
We investigate the loss of context awareness after supervised fine-tuning.<n>We find that the performance decline is associated with a bias toward different roles learned during conversational instruction fine-tuning.<n>We propose a metric to identify context-dependent examples from general instruction fine-tuning datasets.
arXiv Detail & Related papers (2024-11-05T00:16:01Z) - Maintaining Informative Coherence: Migrating Hallucinations in Large Language Models via Absorbing Markov Chains [6.920249042435973]
Large Language Models (LLMs) are powerful tools for text generation, translation, and summarization.
LLMs often suffer from hallucinations-instances where they fail to maintain the fidelity and coherence of contextual information.
We propose a novel decoding strategy that leverages absorbing Markov chains to quantify the significance of contextual information.
arXiv Detail & Related papers (2024-10-27T04:51:18Z) - CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning [101.81127587760831]
Current fine-tuning methods build adapters widely of the context of downstream task to learn, or the context of important knowledge to maintain.<n>We propose CorDA, a Context-oriented Decomposition Adaptation method that builds learnable task-aware adapters.<n>Our method enables two options, the knowledge-preserved adaptation and the instruction-previewed adaptation.
arXiv Detail & Related papers (2024-06-07T19:10:35Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - UNITS: Unsupervised Intermediate Training Stage for Scene Text Detection [16.925048424113463]
We propose a new training paradigm for scene text detection, which introduces an textbfUNsupervised textbfIntermediate textbfTraining textbfStage (UNITS)
UNITS builds a buffer path to real-world data and can alleviate the gap between the pre-training stage and fine-tuning stage.
Three training strategies are further explored to perceive information from real-world data in an unsupervised way.
arXiv Detail & Related papers (2022-05-10T05:34:58Z) - How does a Pre-Trained Transformer Integrate Contextual Keywords?
Application to Humanitarian Computing [0.0]
This paper describes how to improve a humanitarian classification task by adding the crisis event type to each tweet to be classified.
It shows how the proposed neural network approach is partially over-fitting the particularities of the Crisis Benchmark.
arXiv Detail & Related papers (2021-11-07T11:24:08Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.