Tabula RASA: Exposing and Breaking the Relational Bottleneck in Transformers
- URL: http://arxiv.org/abs/2602.02834v2
- Date: Wed, 04 Feb 2026 11:02:18 GMT
- Title: Tabula RASA: Exposing and Breaking the Relational Bottleneck in Transformers
- Authors: Jonas Petersen, Camilla Mazzoleni, Riccardo Maggioni,
- Abstract summary: RASA (Relation-Aware Sparse Attention) is a minimal architectural modification that provides structural inductive bias for relational reasoning.<n>Our results demonstrate that minimal architectural modifications, grounded in complexity-theoretic analysis, can substantially improve multi-hop reasoning.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers achieve remarkable performance across many domains, yet struggle with tasks requiring multi-hop relational reasoning over structured data. We analyze this limitation through circuit complexity: standard transformers are $\mathsf{TC}^0$-complete and cannot solve graph connectivity in constant depth, implying $Ω(k)$ layers are necessary for $k$-hop reasoning regardless of model size or training data. We introduce RASA (Relation-Aware Sparse Attention), a minimal architectural modification that provides structural inductive bias for relational reasoning. RASA adds: (1) sparse adjacency masking that restricts attention to graph-connected positions, reducing the attention pattern search space from $O(2^{n^2})$ to $O(2^m)$ for graphs with $m$ edges; and (2) learnable edge-type biases that encode relation-specific attention preferences. While RASA does not circumvent asymptotic depth requirements, the exponential reduction in attention pattern space provides stronger inductive bias for learning graph-structured functions. Empirically, on the MetaQA knowledge graph QA benchmark, RASA achieves 97.7% accuracy on 3-hop questions, outperforming EmbedKGQA (94.8%) by 2.9 percentage points. Notably, RASA's advantage grows with reasoning depth, validating that structural inductive bias is most beneficial for complex multi-hop queries. Our results demonstrate that minimal architectural modifications, grounded in complexity-theoretic analysis, can substantially improve multi-hop reasoning.
Related papers
- HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG [53.30561659838455]
Large Language Models (LLMs) often struggle with inherent knowledge boundaries and hallucinations.<n>Retrieval-Augmented Generation (RAG) frequently overlooks structural interdependencies essential for multi-hop reasoning.<n>Help achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$times$ speedup over leading Graph-based RAG baselines.
arXiv Detail & Related papers (2026-02-24T14:05:29Z) - Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search [2.744879848686032]
Multi-hop question answering over knowledge graphs remains computationally challenging.<n>Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking.<n>We present two complementary hybrid algorithms that address both efficiency and verifiability.
arXiv Detail & Related papers (2025-11-24T19:27:56Z) - TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework [62.66056331998838]
TeaRAG is a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps.<n>Our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps.
arXiv Detail & Related papers (2025-11-07T16:08:34Z) - Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation [82.2288581878096]
We present a framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity.<n>We show that models can be endowed with such dynamic inference pathways without any architectural modifications.
arXiv Detail & Related papers (2025-09-05T16:40:13Z) - You Don't Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures [16.867592142212203]
Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge.<n>Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning.<n>Existing Graph-based RAG methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency.<n>We propose LogicRAG that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph.
arXiv Detail & Related papers (2025-08-08T08:07:40Z) - Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z) - Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers [9.50669909278749]
We extend grokking to real-world factual data and address the challenge of dataset sparsity.<n>Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits.<n>Our approach achieves up to 95-100% accuracy on multi-hop reasoning benchmarks.
arXiv Detail & Related papers (2025-04-29T13:33:29Z) - Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent [15.291830857281015]
We investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods.<n>Our analysis demonstrates that even after $mathrmpoly(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large.
arXiv Detail & Related papers (2025-04-07T03:08:12Z) - On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Less is More: Hop-Wise Graph Attention for Scalable and Generalizable Learning on Circuits [12.468913984721018]
HOGA is a novel attention-based model for learning circuit representations in a scalable and generalizable manner.
As a result, HOGA is adaptive to various structures across different circuits and can be efficiently trained in a distributed manner.
arXiv Detail & Related papers (2024-03-02T21:33:23Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models [73.4425450752596]
Chain-of-thought (CoT) prompting has impressively unlocked the reasoning potential of large language models (LLMs)
Yet, the standard CoT is less effective in problems demanding multiple reasoning steps.
We propose RESPROMPT, a new prompting strategy that advances multi-step reasoning in LLMs.
arXiv Detail & Related papers (2023-10-07T08:56:28Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer.
Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.