Related papers: Empirical Study on Updating Key-Value Memories in Transformer Feed-forward Layers

Empirical Study on Updating Key-Value Memories in Transformer Feed-forward Layers

URL: http://arxiv.org/abs/2402.12233v1
Date: Mon, 19 Feb 2024 15:42:54 GMT
Title: Empirical Study on Updating Key-Value Memories in Transformer Feed-forward Layers
Authors: Zihan Qiu, Zeyu Huang, Youcheng Huang and Jie Fu
Abstract summary: The feed-forward networks (FFNs) in transformers are recognized as a group of key-value neural memories to restore abstract high-level knowledge. We conduct an empirical ablation study on updating keys (the 1st layer in the FFNs layer) or values. We compare those two methods in various knowledge editing and fine-tuning tasks of large language models to draw insights to understand FFNs further.
Score: 27.636372947415186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The feed-forward networks (FFNs) in transformers are recognized as a group of key-value neural memories to restore abstract high-level knowledge. In this work, we conduct an empirical ablation study on updating keys (the 1st layer in the FFNs layer) or values (the 2nd layer in the FFNs layer). We compare those two methods in various knowledge editing and fine-tuning tasks of large language models to draw insights to understand FFNs further. Code is available at $\href{https://github.com/qiuzh20/Tuning-keys-v.s.-values}{this\,repo}$.

Related papers

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding [58.364933651703524]
We show that concentrated massive values consistently emerge in specific regions of attention queries. These massive values play a critical role in interpreting contextual knowledge. We trace the emergence of massive values and find that such concentration is caused by Rotary Positional.
arXiv Detail & Related papers (2025-02-03T17:47:03Z)
Reversible Decoupling Network for Single Image Reflection Removal [15.763420129991255]
High-level semantic clues tend to be compressed or discarded during layer-by-layer propagation. We propose a novel architecture called Reversible Decoupling Network (RDNet) RDNet employs a reversible encoder to secure valuable information while flexibly decoupling transmission- and reflection-relevant features during the forward pass.
arXiv Detail & Related papers (2024-10-10T15:58:27Z)
How Powerful Potential of Attention on Image Restoration? [97.9777639562205]
We conduct an empirical study to explore the potential of attention mechanisms without using FFN. We propose Continuous Scaling Attention (textbfCSAttn), a method that computes attention continuously in three stages without using FFN. Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance.
arXiv Detail & Related papers (2024-03-15T14:23:12Z)
How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study [27.23388511249688]
This paper investigates the layer-wise capability of large language models to encode knowledge. We leverage the powerful generative capability of ChatGPT to construct probing datasets. Experiments on conflicting and newly acquired knowledge show that LLMs prefer to encode more context knowledge in the upper layers.
arXiv Detail & Related papers (2024-02-25T11:15:42Z)
A Study on ReLU and Softmax in Transformer [51.0740713922741]
The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories. We first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax. In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large.
arXiv Detail & Related papers (2023-02-13T15:41:20Z)
Technical Report: Combining knowledge from Transfer Learning during training and Wide Resnets [2.3859169601259342]
We combine the idea of Wide ResNets and transfer learning to optimize the architecture of deep neural networks. The first improvement of the architecture is the use of all layers as information source for the last layer. The second improvement is the use of deeper layers instead of deeper sequences of blocks.
arXiv Detail & Related papers (2022-06-20T10:40:59Z)
Kformer: Knowledge Injection in Transformer Feed-Forward Layers [107.71576133833148]
We propose a novel knowledge fusion model, namely Kformer, which incorporates external knowledge through the feed-forward layer in Transformer. We empirically find that simply injecting knowledge into FFN can facilitate the pre-trained language model's ability and facilitate current knowledge fusion methods.
arXiv Detail & Related papers (2022-01-15T03:00:27Z)
HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance. Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z)
Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers. We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks. The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z)
Transformer Feed-Forward Layers Are Key-Value Memories [49.52087581977751]
We show that feed-forward layers in transformer-based language models operate as key-value memories. We show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones.
arXiv Detail & Related papers (2020-12-29T19:12:05Z)
Deep Learning Models for Automatic Summarization [0.0]
This article reviews a number of recent Deep Learning architectures that have helped to advance research in this field. We will discuss in particular applications of pointer networks, hierarchical Transformers and Reinforcement Learning.
arXiv Detail & Related papers (2020-05-25T09:12:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.