Related papers: Gaussian Match-and-Copy: A Minimalist Benchmark for Studying Transformer Induction

Gaussian Match-and-Copy: A Minimalist Benchmark for Studying Transformer Induction

URL: http://arxiv.org/abs/2602.07562v1
Date: Sat, 07 Feb 2026 14:18:11 GMT
Title: Gaussian Match-and-Copy: A Minimalist Benchmark for Studying Transformer Induction
Authors: Antoine Gonon, Alexandre Cordonnier, Nicolas Boumal,
Abstract summary: We introduce a minimalist benchmark that isolates long-range retrieval through pure second-order correlation signals.<n> Numerical investigations show that this task retains key qualitative aspects of how Transformers develop match-and-copy circuits.<n>We prove this max-margin alignment for GD trajectories that reach vanishing empirical loss under explicit technical conditions.
Score: 44.83333974000826
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Match-and-copy is a core retrieval primitive used at inference time by large language models to retrieve a matching token from the context then copy its successor. Yet, understanding how this behavior emerges on natural data is challenging because retrieval and memorization are entangled. To disentangle the two, we introduce Gaussian Match-and-Copy (GMC), a minimalist benchmark that isolates long-range retrieval through pure second-order correlation signals. Numerical investigations show that this task retains key qualitative aspects of how Transformers develop match-and-copy circuits in practice, and separates architectures by their retrieval capabilities. We also analyze the optimization dynamics in a simplified attention setting. Although many solutions are a priori possible under a regression objective, including ones that do not implement retrieval, we identify an implicit-bias regime in which gradient descent drives the parameters to diverge while their direction aligns with the max-margin separator, yielding hard match selection. We prove this max-margin alignment for GD trajectories that reach vanishing empirical loss under explicit technical conditions.

Related papers

GENSR: Symbolic Regression Based in Equation Generative Space [15.186848349610363]
GenSR is a generative latent space-based SR framework.<n>From a Bayesian perspective, GenSR reframes the SR task as maximizing the conditional distribution $p(mathrmEqu. mid mathrmNum.)$.
arXiv Detail & Related papers (2026-02-24T05:14:34Z)
To Copy or Not to Copy: Copying Is Easier to Induce Than Recall [5.057026826740146]
Language models must arbitrate between parametric knowledge stored in their weights and contextual information in the prompt.<n>This work presents a mechanistic study of that choice by extracting an empharbitration vector from model activations on a curated dataset.
arXiv Detail & Related papers (2026-01-17T14:46:29Z)
Robust Canonicalization through Bootstrapped Data Re-Alignment [5.437226012505534]
Fine-grained visual classification tasks, such as insect and bird identification, demand sensitivity to subtle visual cues.<n>We propose a bootstrapping algorithm that iteratively re-aligns training samples by reducing variance.<n>We show that our method consistently outperforms equivariant, and canonicalization baselines while performing on par with augmentation.
arXiv Detail & Related papers (2025-10-09T13:05:20Z)
Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations [8.796275989527054]
We propose a novel organization of the inverted index that enables fast retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions.
arXiv Detail & Related papers (2024-04-29T15:49:27Z)
Curvature-Informed SGD via General Purpose Lie-Group Preconditioners [6.760212042305871]
We present a novel approach to accelerate gradient descent (SGD) by utilizing curvature information. Our approach involves two preconditioners: a matrix-free preconditioner and a low-rank approximation preconditioner. We demonstrate that Preconditioned SGD (PSGD) outperforms SoTA on Vision, NLP, and RL tasks.
arXiv Detail & Related papers (2024-02-07T03:18:00Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction [57.16121098944589]
RDA is a pioneering approach designed to address two primary deficiencies prevalent in previous endeavors aiming at stealing pre-trained encoders. It is accomplished via a sample-wise prototype, which consolidates the target encoder's representations for a given sample's various perspectives. For more potent efficacy, we develop a multi-relational extraction loss that trains the surrogate encoder to Discriminate mismatched embedding-prototype pairs.
arXiv Detail & Related papers (2023-12-01T15:03:29Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
Mutual Exclusivity Training and Primitive Augmentation to Induce Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models. We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples. We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z)
Pairwise Supervised Hashing with Bernoulli Variational Auto-Encoder and Self-Control Gradient Estimator [62.26981903551382]
Variational auto-encoders (VAEs) with binary latent variables provide state-of-the-art performance in terms of precision for document retrieval. We propose a pairwise loss function with discrete latent VAE to reward within-class similarity and between-class dissimilarity for supervised hashing. This new semantic hashing framework achieves superior performance compared to the state-of-the-arts.
arXiv Detail & Related papers (2020-05-21T06:11:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.