Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
- URL: http://arxiv.org/abs/2504.20752v2
- Date: Wed, 07 May 2025 09:47:51 GMT
- Title: Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
- Authors: Roman Abramov, Felix Steinbauer, Gjergji Kasneci,
- Abstract summary: We extend grokking to real-world factual data and address the challenge of dataset sparsity.<n>Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits.<n>Our approach achieves up to 95-100% accuracy on multi-hop reasoning benchmarks.
- Score: 9.50669909278749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.
Related papers
- HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG [53.30561659838455]
Large Language Models (LLMs) often struggle with inherent knowledge boundaries and hallucinations.<n>Retrieval-Augmented Generation (RAG) frequently overlooks structural interdependencies essential for multi-hop reasoning.<n>Help achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$times$ speedup over leading Graph-based RAG baselines.
arXiv Detail & Related papers (2026-02-24T14:05:29Z) - Tabula RASA: Exposing and Breaking the Relational Bottleneck in Transformers [0.0]
RASA (Relation-Aware Sparse Attention) is a minimal architectural modification that provides structural inductive bias for relational reasoning.<n>Our results demonstrate that minimal architectural modifications, grounded in complexity-theoretic analysis, can substantially improve multi-hop reasoning.
arXiv Detail & Related papers (2026-02-02T21:35:39Z) - Plain Transformers are Surprisingly Powerful Link Predictors [57.01966734467712]
Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies.<n>While Graph Neural Networks (GNNs) are the standard solution, state-of-the-art pipelines often rely on explicit structurals or memory-intensive node embeddings.<n>We present PENCIL, an encoder-only plain Transformer that replaces hand-crafted priors with attention over sampled local subgraphs.
arXiv Detail & Related papers (2026-02-02T02:45:52Z) - Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers [15.965423731432422]
We conduct a study to evaluate the Generalization Circuit's role in knowledge assimilation and transfer.<n>We argue that grokking is the process of integrating memorized atomic facts into an naturally established reasoning path.
arXiv Detail & Related papers (2026-01-14T00:40:35Z) - Milestones over Outcome: Unlocking Geometric Reasoning with Sub-Goal Verifiable Reward [67.00373428443879]
We introduce a paradigm shift towards subgoal-level evaluation and learning.<n>We first construct GeoGoal, a benchmark synthesized via a rigorous formal verification data engine.<n>We propose the Sub-Goal Verifiable Reward (SGVR) framework, which replaces sparse signals with dense rewards based on the Skeleton Rate.
arXiv Detail & Related papers (2026-01-08T16:17:56Z) - Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling [60.63703438729223]
We show how different architectures and training methods affect model multi-step reasoning capabilities.<n>We confirm that increasing model depth plays a crucial role for sequential computations.
arXiv Detail & Related papers (2025-08-22T18:57:08Z) - Can Test-time Computation Mitigate Memorization Bias in Neural Symbolic Regression? [32.15408441849578]
Symbolic regression aims to discover mathematical equations that fit given numerical data.<n>Recent methods that involve Transformers pre-trained on large-scale synthetic datasets have gained attention.<n>While these methods offer advantages such as short inference time, they suffer from low performance, particularly when the number of input variables is large.
arXiv Detail & Related papers (2025-05-28T08:01:25Z) - Mixture of Parrots: Experts improve memorization more than reasoning [72.445819694797]
We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate.<n>We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
arXiv Detail & Related papers (2024-10-24T17:54:41Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.<n>This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Discovering physical laws with parallel combinatorial tree search [57.05912962368898]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.<n>Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade.<n>We introduce a parallel tree search (PCTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [22.033370572209744]
We study whether transformers can learn to implicitly reason over parametric knowledge.
We focus on two representative reasoning types, composition and comparison.
We find that transformers can learn implicit reasoning, but only through grokking.
arXiv Detail & Related papers (2024-05-23T21:42:19Z) - FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion [24.964973946366335]
We develop a novel retrieval-based method, FT2Ra, which aims to mimic genuine fine-tuning.
FT2Ra achieves a 4.29% improvement in accuracy compared to the best baseline method on UniXcoder.
arXiv Detail & Related papers (2024-04-02T01:42:15Z) - Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning [80.44084021062105]
We propose a novel latent partial causal model for multimodal data, featuring two latent coupled variables, connected by an undirected edge, to represent the transfer of knowledge across modalities.<n>Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by multimodal contrastive learning correspond to the latent coupled variables up to a trivial transformation.<n>Experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets.
arXiv Detail & Related papers (2024-02-09T07:18:06Z) - Deep Generative Symbolic Regression [83.04219479605801]
Symbolic regression aims to discover concise closed-form mathematical equations from data.
Existing methods, ranging from search to reinforcement learning, fail to scale with the number of input variables.
We propose an instantiation of our framework, Deep Generative Symbolic Regression.
arXiv Detail & Related papers (2023-12-30T17:05:31Z) - EXPLAIN, EDIT, GENERATE: Rationale-Sensitive Counterfactual Data
Augmentation for Multi-hop Fact Verification [28.453817513380276]
We develop a rationale-sensitive method to generate linguistically diverse and label-flipping counterfactuals.
In specific, the diverse and fluent counterfactuals are generated via an Explain-Edit-Generate architecture.
Experimental results show that the proposed approach outperforms the SOTA baselines.
arXiv Detail & Related papers (2023-10-23T02:39:14Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - Pushing the Limits of Rule Reasoning in Transformers through Natural
Language Satisfiability [30.01308882849197]
We propose a new methodology for creating challenging algorithmic reasoning datasets.
Key idea is to draw insights from empirical sampling of hard propositional SAT problems and from complexity-theoretic studies of language.
We find that current transformers, given sufficient training data, are surprisingly robust at solving the resulting NLSat problems.
arXiv Detail & Related papers (2021-12-16T17:47:20Z) - On the Robustness and Generalization of Deep Learning Driven Full
Waveform Inversion [2.5382095320488665]
Full Waveform Inversion (FWI) is commonly epitomized as an image-to-image translation task.
Despite being trained with synthetic data, the deep learning-driven FWI is expected to perform well when evaluated with sufficient real-world data.
We study such properties by asking: how robust are these deep neural networks and how do they generalize?
arXiv Detail & Related papers (2021-11-28T19:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.