Related papers: Transformer Feed-Forward Layers Are Key-Value Memories

Transformer Feed-Forward Layers Are Key-Value Memories

URL: http://arxiv.org/abs/2012.14913v1
Date: Tue, 29 Dec 2020 19:12:05 GMT
Title: Transformer Feed-Forward Layers Are Key-Value Memories
Authors: Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy
Abstract summary: We show that feed-forward layers in transformer-based language models operate as key-value memories. We show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones.
Score: 49.52087581977751
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys' input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model's layers via residual connections to produce the final output distribution.

Related papers

The Generalization Ridge: Information Flow in Natural Language Generation [7.756342860929851]
We show how predictive information peaks in upper-middle layers-forming a generalization ridge-before declining in final layers.<n>These findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.
arXiv Detail & Related papers (2025-07-07T18:18:51Z)
Layer by Layer: Uncovering Hidden Representations in Language Models [28.304269706993942]
We show that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks. Our framework highlights how each model layer balances information compression and signal preservation. These findings challenge the standard focus on final-layer embeddings and open new directions for model analysis and optimization.
arXiv Detail & Related papers (2025-02-04T05:03:42Z)
Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers [2.1572258716881905]
We explore how token-level sparsity evolves over the course of training, and how it connects to broader sparsity patterns. In particular, we demonstrate that the first and last layer of the network have distinctive and in many ways inverted relationships to sparsity. We additionally explore the phenomenon of ReLU dimensions "turning off", and show evidence suggesting that "neuron death" is being driven by the dynamics of training.
arXiv Detail & Related papers (2024-07-10T17:10:10Z)
Pyramid Hierarchical Transformer for Hyperspectral Image Classification [1.9427851979929982]
We propose a pyramid-based hierarchical transformer (PyFormer) This innovative approach organizes input data hierarchically into segments, each representing distinct abstraction levels. Results underscore the superiority of the proposed method over traditional approaches.
arXiv Detail & Related papers (2024-04-23T11:41:19Z)
A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data [55.748186000425996]
Recent advancements show that diffusion models can generate high-quality images. We study this phenomenon in a hierarchical generative model of data. Our analysis characterises the relationship between time and scale in diffusion models.
arXiv Detail & Related papers (2024-02-26T19:52:33Z)
Pushdown Layers: Encoding Recursive Structure in Transformer Language Models [86.75729087623259]
Recursion is a prominent feature of human language, and fundamentally challenging for self-attention. This work introduces Pushdown Layers, a new self-attention layer. Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization.
arXiv Detail & Related papers (2023-10-29T17:27:18Z)
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples. Our memory models the distribution of past keys and values through the definition of prototype vectors. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z)
VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers [45.42482446288144]
Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models to their vocabulary. We investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input. We create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph.
arXiv Detail & Related papers (2023-05-22T19:04:56Z)
The geometry of hidden representations of large transformer models [43.16765170255552]
Large transformers are powerful architectures used for self-supervised data analysis across various data types. We show that the semantic structure of the dataset emerges from a sequence of transformations between one representation and the next. We show that the semantic information of the dataset is better expressed at the end of the first peak, and this phenomenon can be observed across many models trained on diverse datasets.
arXiv Detail & Related papers (2023-02-01T07:50:26Z)
ContraFeat: Contrasting Deep Features for Semantic Discovery [102.4163768995288]
StyleGAN has shown strong potential for disentangled semantic control. Existing semantic discovery methods on StyleGAN rely on manual selection of modified latent layers to obtain satisfactory manipulation results. We propose a model that automates this process and achieves state-of-the-art semantic discovery performance.
arXiv Detail & Related papers (2022-12-14T15:22:13Z)
Stochastic Layers in Vision Transformers [85.38733795180497]
We introduce fully layers in vision transformers, without causing any severe drop in performance. The additionality boosts the robustness of visual features and strengthens privacy. We use our features for three different applications, namely, adversarial robustness, network calibration, and feature privacy.
arXiv Detail & Related papers (2021-12-30T16:07:59Z)
Mitigating Generation Shifts for Generalized Zero-Shot Learning [52.98182124310114]
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training. We propose a novel Generation Shifts Mitigating Flow framework for learning unseen data synthesis efficiently and effectively. Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings.
arXiv Detail & Related papers (2021-07-07T11:43:59Z)
Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. We propose the Feedback Transformer architecture that exposes all previous representations to all future representations. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.