Related papers: Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention

Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention

URL: http://arxiv.org/abs/2501.00823v2
Date: Mon, 06 Jan 2025 14:26:41 GMT
Title: Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention
Authors: Zhenyu Guo, Wenguang Chen,
Abstract summary: This paper introduces a novel modular Transformer architecture that explicitly decouples knowledge and reasoning.<n>We provide a rigorous mathematical derivation demonstrating that the Feed-Forward Network (FFN) in a standard Transformer is a specialized case.
Score: 9.401360346241296
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have achieved remarkable success across diverse domains, but their monolithic architecture presents challenges in interpretability, adaptability, and scalability. This paper introduces a novel modular Transformer architecture that explicitly decouples knowledge and reasoning through a generalized cross-attention mechanism to a globally shared knowledge base with layer-specific transformations, specifically designed for effective knowledge retrieval. Critically, we provide a rigorous mathematical derivation demonstrating that the Feed-Forward Network (FFN) in a standard Transformer is a specialized case (a closure) of this generalized cross-attention, revealing its role in implicit knowledge retrieval and validating our design. This theoretical framework provides a new lens for understanding FFNs and lays the foundation for future research exploring enhanced interpretability, adaptability, and scalability, enabling richer interplay with external knowledge bases and other systems.

Related papers

Importance inversion transfer identifies shared principles for cross-domain learning [0.0]
This study formalizes a framework unifying network science and explainable artificial intelligence.<n>It prioritizes structural invariants that generalize across biological, linguistic, molecular, and social networks.
arXiv Detail & Related papers (2026-02-09T19:06:52Z)
On the Universality of Transformer Architectures; How Much Attention Is Enough? [0.0]
Transformers are crucial across many AI fields, such as large language models, computer vision, and reinforcement learning.<n>This work examines the problem of universality in Transformers, reviews recent progress, and surveys state-of-the-art advances.
arXiv Detail & Related papers (2025-12-20T17:31:59Z)
Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning [50.99796659680724]
This work investigates out-of-distribution (OOD) generalization in Transformer networks using a GSM8K-style modular arithmetic on computational graphs task as a testbed.<n>We introduce and explore a set of four architectural mechanisms aimed at enhancing OOD generalization.<n>We complement these empirical results with a detailed mechanistic interpretability analysis that reveals how these mechanisms give rise to robust OOD generalization abilities.
arXiv Detail & Related papers (2025-10-15T21:03:59Z)
Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel Approach [1.4582633500696451]
This paper proposes a novel knowledge distillation framework that transfers multiscale structural knowledge from GNN teacher models to Transformer student models. The framework effectively bridges the architectural gap between GNNs and Transformers through micro-macro distillation losses and multiscale feature alignment.
arXiv Detail & Related papers (2025-02-27T05:14:47Z)
Can Transformers Learn Full Bayesian Inference in Context? [13.479322264788367]
We show that transformers can perform full Bayesian inference for commonly used statistical models in context. We introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows. Experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods.
arXiv Detail & Related papers (2025-01-28T10:04:53Z)
Enhancing Transformers for Generalizable First-Order Logical Entailment [51.04944136538266]
This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge.<n>The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment.<n>We propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.
arXiv Detail & Related papers (2025-01-01T07:05:32Z)
Knowledge-enhanced Transformer for Multivariate Long Sequence Time-series Forecasting [4.645182684813973]
We introduce a novel approach that encapsulates conceptual relationships among variables within a well-defined knowledge graph. We investigate the influence of this integration into seminal architectures such as PatchTST, Autoformer, Informer, and Vanilla Transformer. This enhancement empowers transformer-based architectures to address the inherent structural relation between variables.
arXiv Detail & Related papers (2024-11-17T11:53:54Z)
MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and Modalities [72.05167902805405]
We present MergeNet, which learns to bridge the gap of parameter spaces of heterogeneous models.<n>The core mechanism of MergeNet lies in the parameter adapter, which operates by querying the source model's low-rank parameters.<n> MergeNet is learned alongside both models, allowing our framework to dynamically transfer and adapt knowledge relevant to the current stage.
arXiv Detail & Related papers (2024-04-20T08:34:39Z)
Foundations for Transfer in Reinforcement Learning: A Taxonomy of Knowledge Modalities [28.65224261733876]
We look at opportunities and challenges in refining the generalisation and transfer of knowledge. Within the domain of reinforcement learning (RL), the representation of knowledge manifests through various modalities. This taxonomy systematically targets these modalities and frames its discussion based on their inherent properties and alignment with different objectives and mechanisms for transfer.
arXiv Detail & Related papers (2023-12-04T14:55:58Z)
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers [5.356051655680145]
This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT 2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture.
arXiv Detail & Related papers (2023-11-17T16:58:52Z)
Knowledge-Infused Self Attention Transformers [11.008412414253662]
Transformer-based language models have achieved impressive success in various natural language processing tasks. This paper introduces a systematic method for infusing knowledge into different components of a transformer-based model.
arXiv Detail & Related papers (2023-06-23T13:55:01Z)
SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning. The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily. Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z)
Kformer: Knowledge Injection in Transformer Feed-Forward Layers [107.71576133833148]
We propose a novel knowledge fusion model, namely Kformer, which incorporates external knowledge through the feed-forward layer in Transformer. We empirically find that simply injecting knowledge into FFN can facilitate the pre-trained language model's ability and facilitate current knowledge fusion methods.
arXiv Detail & Related papers (2022-01-15T03:00:27Z)
KAT: A Knowledge Augmented Transformer for Vision-and-Language [56.716531169609915]
We propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.
arXiv Detail & Related papers (2021-12-16T04:37:10Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.