Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge
- URL: http://arxiv.org/abs/2602.02470v1
- Date: Mon, 02 Feb 2026 18:50:57 GMT
- Title: Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge
- Authors: Xutao Ma, Yixiao Huang, Hanlin Zhu, Somayeh Sojoudi,
- Abstract summary: We show that even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent.<n>Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.
- Score: 16.509342332774747
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple logical reasoning such as the "reversal curse" -- when trained on forward knowledge data of the form "$A \rightarrow B$" (e.g., Alice's husband is Bob), the model is unable to deduce the reversal knowledge "$B \leftarrow A$" (e.g., Bob's wife is Alice) during test. Extensive prior research suggests that this failure is an inherent, fundamental limit of autoregressive causal LLMs, indicating that these models tend to memorize factual-level knowledge rather than capture higher-level rules. In this paper, we challenge this view by showing that this seemingly fundamental limit can be mitigated by slightly tweaking the training data with a simple regularization data recipe called the Identity Bridge of the form "$A \to A$" (e.g., The name of Alice is Alice). Theoretically, we prove that under this recipe, even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent. Empirically, we show that a 1B pretrained language model finetuned with the proposed data recipe achieves a 40% success rate on reversal tasks, in stark contrast to a near-zero success rate when trained solely on forward-knowledge data. Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.
Related papers
- Nudging the Boundaries of LLM Reasoning [77.26972440427285]
Current online reinforcement learning algorithms cannot learn from problems that are "unsolvable" to the model.<n>We propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints.<n>NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling.
arXiv Detail & Related papers (2025-09-30T02:01:40Z) - Generalist Reward Models: Found Inside Large Language Models [50.7432354447554]
We show that a powerful reward model is already latently present within any Large Language Models (LLMs) trained via standard next-token prediction.<n>We prove that this endogenous reward is not a reward function learned through offline inverse reinforcement learning.<n>We also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model.
arXiv Detail & Related papers (2025-06-29T13:45:54Z) - Layered Unlearning for Adversarial Relearning [4.7066636827902]
We study how post-training methods modify language model behavior and representations.<n>Recent results suggest that post-training induces shallow context-dependent circuits'' that suppress specific response patterns.<n>To test this hypothesis, we design an unlearning algorithm, Layered Unlearning (LU)<n>LU limits the ability of relearning on a subset of data to recover the full dataset.
arXiv Detail & Related papers (2025-05-14T15:50:45Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning [37.061187080745654]
We show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of $textitbenign relearning attacks.<n>With access to only a small and potentially loosely related set of data, we find that we can ''jog'' the memory of unlearned models to reverse the effects of unlearning.
arXiv Detail & Related papers (2024-06-19T09:03:21Z) - Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training [57.771940716189114]
We show that large language models (LLMs) suffer from the "reversal curse"
The root cause of the reversal curse lies in the different word order between the training and inference stage.
We propose Semantic-aware Permutation Training (SPT) to address this issue.
arXiv Detail & Related papers (2024-03-01T18:55:20Z) - An Analysis and Mitigation of the Reversal Curse [70.13419502543915]
Recent research observed a noteworthy phenomenon in large language models (LLMs)
The reversal curse is that when dealing with two entities, $a$ and $b$, LLMs excel in handling sequences in the form of $aRb$,'' but encounter challenges when processing $bR-1a$''
arXiv Detail & Related papers (2023-11-13T17:01:12Z) - Physics of Language Models: Part 3.1, Knowledge Storage and Extraction [51.68385617116854]
Large language models (LLMs) can store a vast amount of world knowledge, often extractable via question-answering.
We find a strong correlation between the model's ability to extract knowledge and various diversity measures of the training data.
arXiv Detail & Related papers (2023-09-25T17:37:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.