Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization
- URL: http://arxiv.org/abs/2510.16096v1
- Date: Fri, 17 Oct 2025 17:58:01 GMT
- Title: Facts in Stats: Impacts of Pretraining Diversity on Language Model Generalization
- Authors: Tina Behnia, Puneesh Deora, Christos Thrampoulidis,
- Abstract summary: This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of source-target token pairs.<n>We find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its impact on out-of-distribution (OOD) factual generalization depends critically on contextual structure.
- Score: 33.5861323022684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models are pretrained on sequences that blend statistical regularities (making text fluent) with factual associations between specific tokens (knowledge of facts). While recent work suggests that the variability of their interaction, such as paraphrases of factual associations, critically determines generalization ability, we lack a systematic analysis of these impacts. This paper introduces a flexible synthetic testbed that combines a statistical stream of generic tokens with an abstract factual stream of source-target token pairs, enabling fine-grained control over their interaction. The design enables the independent control of diversity nature by manipulating stream composition (contextual structure) and the diversity level by varying which statistical streams each fact appears in. Through controlled experiments, we find that while higher contextual diversity delays in-distribution (ID) factual accuracy, its impact on out-of-distribution (OOD) factual generalization depends critically on contextual structure. In some cases, OOD performance follows the same trend as ID, but in others, diversity becomes essential for non-trivial factual recall. Even when low diversity prohibits factual recall, optimal diversity levels depend on training duration. Beyond factual recall failures, we identify structures where statistical generalization fails independently, and others where both capabilities degrade. This shows how the interplay between contextual design and diversity level impacts different generalization aspects. Further, through a series of controlled interventions on the model components, we trace the OOD failures to distinct optimization bottlenecks, highlighting the importance of the embedding and unembedding layers. Our synthetic framework allows us to isolate effects that would be confounded in large-scale studies, offering a controlled testbed for future investigations.
Related papers
- Race, Ethnicity and Their Implication on Bias in Large Language Models [9.202525724606188]
We study how race and ethnicity are represented and operationalized within large language models (LLMs)<n>We find that demographic information is distributed across internal units with substantial cross-model variation.<n> Interventions suppressing such neurons reduce bias but leave substantial residual effects.
arXiv Detail & Related papers (2026-01-19T09:24:24Z) - How Quantization Shapes Bias in Large Language Models [61.40435736418359]
We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types.<n>We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability.
arXiv Detail & Related papers (2025-08-25T14:48:26Z) - Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective [24.54292750583169]
Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications.<n>We propose FiSCo (Fine-grained Semantic Comparison), a novel statistical framework to evaluate group-level fairness in LLMs.<n>We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities.
arXiv Detail & Related papers (2025-06-23T18:31:22Z) - Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation [0.3362278589492841]
Existing model evaluation approaches often rely on real-world datasets, which are limited in availability, embed confounding biases, and lack flexibility needed for systematic experimentation.<n>We propose a novel structured synthetic data framework designed for the controlled robustness of benchmarking model, fairness, and generalisability.
arXiv Detail & Related papers (2025-04-29T11:04:28Z) - Collaborative Value Function Estimation Under Model Mismatch: A Federated Temporal Difference Analysis [55.13545823385091]
Federated reinforcement learning (FedRL) enables collaborative learning while preserving data privacy by preventing direct data exchange between agents.<n>In real-world applications, each agent may experience slightly different transition dynamics, leading to inherent model mismatches.<n>We show that even moderate levels of information sharing significantly mitigate environment-specific errors.
arXiv Detail & Related papers (2025-03-21T18:06:28Z) - Quantitative Assessment of Intersectional Empathetic Bias and Understanding [0.0]
A growing amount of literature critiques the current operationalizations of empathy based on loose definitions of the construct.
We propose an empathy evaluation framework that operationalizes empathy close to its psychological origins.
arXiv Detail & Related papers (2024-11-08T18:43:15Z) - Detecting and Identifying Selection Structure in Sequential Data [53.24493902162797]
We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences.
We show that selection structure is identifiable without any parametric assumptions or interventional experiments.
We also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies.
arXiv Detail & Related papers (2024-06-29T20:56:34Z) - How to Handle Different Types of Out-of-Distribution Scenarios in Computational Argumentation? A Comprehensive and Fine-Grained Field Study [59.13867562744973]
This work systematically assesses LMs' capabilities for out-of-distribution (OOD) scenarios.
We find that the efficacy of such learning paradigms varies with the type of OOD.
Specifically, while ICL excels for domain shifts, prompt-based fine-tuning surpasses for topic shifts.
arXiv Detail & Related papers (2023-09-15T11:15:47Z) - A Causal Framework for Decomposing Spurious Variations [68.12191782657437]
We develop tools for decomposing spurious variations in Markovian and Semi-Markovian models.
We prove the first results that allow a non-parametric decomposition of spurious effects.
The described approach has several applications, ranging from explainable and fair AI to questions in epidemiology and medicine.
arXiv Detail & Related papers (2023-06-08T09:40:28Z) - Exploring Resiliency to Natural Image Corruptions in Deep Learning using
Design Diversity [0.6445605125467573]
We investigate the relationship between diversity metrics, accuracy, and resiliency to natural image corruptions of Deep Learning (DL) image ensembles.
Our motivation is based on analytical studies of design diversity that have shown that a reduction of common failure modes is possible if diversity of design choices is achieved.
arXiv Detail & Related papers (2023-03-15T08:54:10Z) - Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment
Analysis [56.84237932819403]
This paper aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization.
Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis.
arXiv Detail & Related papers (2022-07-24T03:57:40Z) - Learning Causal Semantic Representation for Out-of-Distribution
Prediction [125.38836464226092]
We propose a Causal Semantic Generative model (CSG) based on a causal reasoning so that the two factors are modeled separately.
We show that CSG can identify the semantic factor by fitting training data, and this semantic-identification guarantees the boundedness of OOD generalization error.
arXiv Detail & Related papers (2020-11-03T13:16:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.