Related papers: Enhancing Transformers for Generalizable First-Order Logical Entailment

Enhancing Transformers for Generalizable First-Order Logical Entailment

URL: http://arxiv.org/abs/2501.00759v1
Date: Wed, 01 Jan 2025 07:05:32 GMT
Title: Enhancing Transformers for Generalizable First-Order Logical Entailment
Authors: Tianshi Zheng, Jiazheng Wang, Zihao Wang, Jiaxin Bai, Hang Yin, Zheye Deng, Yangqiu Song, Jianxin Li,
Abstract summary: This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge.<n>The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment.<n>We propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.
Score: 51.04944136538266
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformers, as a fundamental deep learning architecture, have demonstrated remarkable capabilities in reasoning. This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge and explores ways to improve it. The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment, which is quantitatively measured by their performance in answering knowledge graph queries. We establish connections between (1) two types of distribution shifts studied in out-of-distribution generalization and (2) the unseen knowledge and query settings discussed in the task of knowledge graph query answering, enabling a characterization of fine-grained generalizability. Results on our comprehensive dataset show that transformers outperform previous methods specifically designed for this task and provide detailed empirical evidence on the impact of input query syntax, token embedding, and transformer architectures on the reasoning capability of transformers. Interestingly, our findings reveal a mismatch between positional encoding and other design choices in transformer architectures employed in prior practices. This discovery motivates us to propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.

Related papers

Selective Induction Heads: How Transformers Select Causal Structures In Context [50.09964990342878]
We introduce a novel framework that showcases transformers' ability to handle causal structures.<n>Our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed.<n>This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context.
arXiv Detail & Related papers (2025-09-09T23:13:41Z)
What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains [64.31313691823088]
In-context learning (ICL) is a capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context.<n>We show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram.
arXiv Detail & Related papers (2025-08-10T07:03:01Z)
On the Existence of Universal Simulators of Attention [17.01811978811789]
We present solutions to identically replicate attention outputs and the underlying elementary matrix and activation operations via RASP.<n>Our proofs, for the first time, show the existence of an algorithmically achievable data-agnostic solution, previously known to be approximated only by learning.
arXiv Detail & Related papers (2025-06-23T15:15:25Z)
Simplifying Graph Transformers [64.50059165186701]
We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L$ attention to measure the magnitude of closeness tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder.
arXiv Detail & Related papers (2025-04-17T02:06:50Z)
On the Robustness of Transformers against Context Hijacking for Linear Classification [26.1838836907147]
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. They can be disrupted by factually correct context, a phenomenon known as context hijacking. We show that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations.
arXiv Detail & Related papers (2025-02-21T17:31:00Z)
Converting Transformers into DGNNs Form [3.7468283401703797]
We introduce a synthetic unitary digraph convolution based on the digraph Fourier transform. The resulting model, which we term Converter, effectively converts a Transformer into a Directed Graph Neural Network form. We have tested Converter on Long-Range Arena benchmark, long document classification, and DNA sequence-based taxonomy classification.
arXiv Detail & Related papers (2025-02-01T22:44:46Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [22.033370572209744]
We study whether transformers can learn to implicitly reason over parametric knowledge. We focus on two representative reasoning types, composition and comparison. We find that transformers can learn implicit reasoning, but only through grokking.
arXiv Detail & Related papers (2024-05-23T21:42:19Z)
How Transformers Learn Causal Structure with Gradient Descent [44.31729147722701]
Self-attention allows transformers to encode causal structure. We introduce an in-context learning task that requires learning latent causal structure. We show that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
arXiv Detail & Related papers (2024-02-22T17:47:03Z)
AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures [80.28359222380733]
We design a novel transformer framework, dubbed AlgoFormer, to empower transformers with algorithmic capabilities. In particular, inspired by the structure of human-designed learning algorithms, our transformer framework consists of a pre-transformer that is responsible for task preprocessing. Some theoretical and empirical results are presented to show that the designed transformer has the potential to perform algorithm representation and learning.
arXiv Detail & Related papers (2024-02-21T07:07:54Z)
What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention. What makes for a good tokenizer has not been well understood in computer vision. Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z)
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions. We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition. These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z)
Mask and Reason: Pre-Training Knowledge Graph Transformers for Complex Logical Queries [36.22117601006972]
We present the Knowledge Graph Transformer (kgTransformer) with masked pre-training and fine-tuning strategies. kgTransformer can consistently outperform both KG embedding-based baselines and advanced encoders on nine in-domain and out-of-domain reasoning tasks.
arXiv Detail & Related papers (2022-08-16T09:51:26Z)
Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure. The second method imbues structure at the data level by segmenting the data with morphological tokenization. We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z)
XAI for Transformers: Better Explanations through Conservative Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.