Related papers: What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

URL: http://arxiv.org/abs/2509.22947v1
Date: Fri, 26 Sep 2025 21:27:55 GMT
Title: What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?
Authors: Mohammed Sabry, Anya Belz,
Abstract summary: We introduce Bi-Induct, a lightweight curriculum that injects forward-copy (Induction), backward-copy (Anti) or a balanced mix into the pretraining stream.<n>We train models from 0.13B to 1B parameters under iso-FLOPs, evaluating (i) few-shot ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity.
Score: 6.886536285117155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Does explicitly exercising the induction circuit during pretraining improve in-context learning (ICL), or is natural text sufficient when compute is held constant (iso-FLOPs)? To test whether targeted synthetic data can accelerate induction-head emergence and enhance ICL, we introduce Bi-Induct, a lightweight curriculum that injects forward-copy (Induction), backward-copy (Anti), or a balanced mix into the pretraining stream. We train models from 0.13B to 1B parameters under iso-FLOPs, evaluating (i) few-shot ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity. Our findings challenge the assumption that early induction circuit activation directly improves ICL. While Bi-Induct accelerates induction-head emergence at small scales, this does not consistently yield stronger generalization. On standard LM benchmarks, Bi-Induct matches natural-only training; on function-style ICL probes, the 1B natural-only performs best. Stress tests (e.g., label permutation, HITS@1 vs. HITS@3, 1 vs. 10 shots) preserve these trends. Telemetry shows larger natural-only models develop broader, earlier induction heads without explicit induction patterns. Anti-induction data fails to elicit meaningful activation. Perplexity penalties from synthetic data shrink with scale, suggesting larger models can absorb non-natural patterns with minimal cost. Crucially, ablating the top 2% of induction heads degrades ICL more than random ablations, especially for natural-only models, indicating more centralized, load-bearing circuits. Bi-Induct variants exhibit more redundant induction activity, implying different circuit utilization. Overall, inducing activation is not sufficient: ICL gains depend on these circuits becoming functionally necessary. These results underscore mechanism-aware pretraining diagnostics and data mixtures that foster load-bearing, not merely present, structure.

Related papers

From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers [67.02076505996284]
We study how the choice of pretraining data distribution steers a shallow transformer toward one behavior or the other.<n>Our results shed light on the algorithmic biases of pretrained transformers and offer conceptual guidelines for data-driven control of their learned behaviors.
arXiv Detail & Related papers (2025-12-21T08:10:26Z)
In-Context Learning Without Copying [31.718993147344353]
We study whether transformers can still acquire in-context learning capabilities when inductive copying is suppressed.<n>We propose Hapax, a setting where we omit the loss contribution of any token that can be correctly predicted by induction heads.<n>Mechanistic analysis shows that models trained with Hapax develop fewer and weaker induction heads but still preserve ICL capabilities.
arXiv Detail & Related papers (2025-11-07T22:11:11Z)
Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training [57.03005244917803]
Large language models (LLMs) often fail on out-of-distribution (OOD) samples due to spurious correlations acquired during pre-training.<n>Here, we aim to mitigate such spurious correlations through causality-aware post-training (CAPT)<n> Experiments on the formal causal inference benchmark CLadder and the logical reasoning dataset PrOntoQA show that 3B-scale language models fine-tuned with CAPT can outperform both traditional SFT and larger LLMs on in-distribution (ID) and OOD tasks.
arXiv Detail & Related papers (2025-06-11T06:30:28Z)
Normalized Attention Guidance: Universal Negative Guidance for Diffusion Models [57.20761595019967]
We present Normalized Attention Guidance (NAG), an efficient, training-free mechanism that applies extrapolation in attention space with L1-based normalization and refinement.<n>NAG restores effective negative guidance where CFG collapses while maintaining fidelity.<n>NAG generalizes across architectures (UNet, DiT), sampling regimes (few-step, multi-step), and modalities (image, video)
arXiv Detail & Related papers (2025-05-27T13:30:46Z)
Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning [22.341935761925892]
Fine-tuning and in-context learning (ICL) are two prevalent methods in imbuing large language models with task-specific knowledge. This paper presents a counterintuitive finding: For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning.
arXiv Detail & Related papers (2024-10-07T02:12:22Z)
On the Inductive Bias of Stacking Towards Improving Reasoning [50.225873619537765]
We propose a variant of gradual stacking called MIDAS that can speed up language model training by up to 40%. MIDAS is not only training-efficient but surprisingly also has an inductive bias towards improving downstream tasks. We conjecture the underlying reason for this inductive bias by exploring the connection of stacking to looped models.
arXiv Detail & Related papers (2024-09-27T17:58:21Z)
Bidirectional Awareness Induction in Autoregressive Seq2Seq Models [47.82947878753809]
Bidirectional Awareness Induction (BAI) is a training method that leverages a subset of elements in the network, the Pivots, to perform bidirectional learning without breaking the autoregressive constraints. In particular, we observed an increase of up to 2.4 CIDEr in Image-Captioning, 4.96 BLEU in Neural Machine Translation, and 1.16 ROUGE in Text Summarization compared to the respective baselines.
arXiv Detail & Related papers (2024-08-25T23:46:35Z)
Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning [12.911829891263263]
We show that even a minimal ablation of induction heads leads to ICL performance decreases of up to 32% for abstract pattern recognition tasks.<n>For NLP tasks, this ablation substantially decreases the model's ability to benefit from examples, bringing few-shot ICL performance close to that of zero-shot prompts.
arXiv Detail & Related papers (2024-07-09T16:29:21Z)
Tripod: Three Complementary Inductive Biases for Disentangled Representation Learning [52.70210390424605]
In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. We propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks.
arXiv Detail & Related papers (2024-04-16T04:52:41Z)
SIP: Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation [75.14793516745374]
We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data. Our experiments show that our method imparts the desired inductive bias, resulting in better few-shot learning for FST-like tasks.
arXiv Detail & Related papers (2023-10-01T21:19:12Z)
Self-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations [38.4166247280112]
Self-ICL is a framework which bootstraps LMs' intrinsic capabilities to perform zero-shot ICL. Self-ICL outperforms zero-shot baselines on both average accuracy and head-to-head comparison.
arXiv Detail & Related papers (2023-05-24T11:22:34Z)
Bayesian Optimization of Catalysis With In-Context Learning [0.6576679530536251]
Large language models (LLMs) can perform accurate classification with zero or few examples through in-context learning.<n>We extend this capability to regression with uncertainty estimation using frozen LLMs.<n>We apply this to materials discovery by representing experimental catalyst and testing procedures as natural language prompts.
arXiv Detail & Related papers (2023-04-11T17:00:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.