GPT-2 Through the Lens of Vector Symbolic Architectures
- URL: http://arxiv.org/abs/2412.07947v1
- Date: Tue, 10 Dec 2024 22:20:36 GMT
- Title: GPT-2 Through the Lens of Vector Symbolic Architectures
- Authors: Johannes Knittel, Tushaar Gangavarapu, Hendrik Strobelt, Hanspeter Pfister,
- Abstract summary: This paper explores the resemblance between decoder-only transformer architecture and vector symbolic architectures (VSA)<n>It shows that these principles help explain a significant portion of the actual neural weights.
- Score: 36.744603771123344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the general priniciples behind transformer models remains a complex endeavor. Experiments with probing and disentangling features using sparse autoencoders (SAE) suggest that these models might manage linear features embedded as directions in the residual stream. This paper explores the resemblance between decoder-only transformer architecture and vector symbolic architectures (VSA) and presents experiments indicating that GPT-2 uses mechanisms involving nearly orthogonal vector bundling and binding operations similar to VSA for computation and communication between layers. It further shows that these principles help explain a significant portion of the actual neural weights.
Related papers
- Polyhedra Encoding Transformers: Enhancing Diffusion MRI Analysis Beyond Voxel and Volumetric Embedding [9.606654786275902]
In this paper, we propose a novel method called Polyhedra Transformer (PE-Transformer) for dMRI, designed specifically to handle spherical signals.
Our approach involves projecting an icosahedral unit sphere to resample signals from predetermined directions. These resampled signals are then transformed into embeddings, which are processed by a transformer encoder that incorporates orientational information reflective of the icosahedral structure.
arXiv Detail & Related papers (2025-01-23T03:32:52Z) - Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi [0.0]
Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers.
vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers.
This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism.
arXiv Detail & Related papers (2025-01-22T14:19:48Z) - Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective [3.218600495900291]
We argue that there are fundamental connections between semantic segmentation and compression.
We derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT)
Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter.
arXiv Detail & Related papers (2024-11-05T12:10:02Z) - Hierarchical Transformer for Electrocardiogram Diagnosis [1.4124476944967472]
Transformers, originally prominent in NLP and computer vision, are now being adapted for ECG signal analysis.
This paper introduces a novel hierarchical transformer architecture that segments the model into multiple stages.
A classification token aggregates information across feature scales, facilitating interactions between different stages of the transformer.
arXiv Detail & Related papers (2024-11-01T17:28:03Z) - Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights.
This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task.
We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation [49.65221743520028]
We show that a transformer-based detector with scale-aware attention enables the plain detector SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features.
Compared to the multi-scale and single-scale state-of-the-art, our model scales much better with bigger capacity (self-supervised) models and more pre-training data.
arXiv Detail & Related papers (2023-10-09T17:59:26Z) - Neuromorphic Visual Scene Understanding with Resonator Networks [11.701553530610973]
We propose a neuromorphic solution exploiting three key concepts.
The framework is based on Vector Architectures with complex-valued vectors.
The network is factorized to factorize the non-commutative transforms translation and rotation in visual scenes.
A companion paper demonstrates the same approach in real-world application scenarios for machine vision and robotics.
arXiv Detail & Related papers (2022-08-26T22:17:52Z) - Residual and Attentional Architectures for Vector-Symbols [0.0]
Vector-symbolic architectures (VSAs) provide methods for computing which are highly flexible and carry unique advantages.
In this work, we combine efficiency of the operations provided within the framework of the Fourier Holographic Reduced Representation (FHRR) VSA with the power of deep networks to construct novel VSA based residual and attention-based neural network architectures.
This demonstrates a novel application of VSAs and a potential path to implementing state-of-the-art neural models on neuromorphic hardware.
arXiv Detail & Related papers (2022-07-18T21:38:43Z) - Exploring Structure-aware Transformer over Interaction Proposals for
Human-Object Interaction Detection [119.93025368028083]
We design a novel Transformer-style Human-Object Interaction (HOI) detector, i.e., Structure-aware Transformer over Interaction Proposals (STIP)
STIP decomposes the process of HOI set prediction into two subsequent phases, i.e., an interaction proposal generation is first performed, and then followed by transforming the non-parametric interaction proposals into HOI predictions via a structure-aware Transformer.
The structure-aware Transformer upgrades vanilla Transformer by encoding additionally the holistically semantic structure among interaction proposals as well as the locally spatial structure of human/object within each interaction proposal, so as to strengthen HOI
arXiv Detail & Related papers (2022-06-13T16:21:08Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.