Empirical Study on Updating Key-Value Memories in Transformer
Feed-forward Layers
- URL: http://arxiv.org/abs/2402.12233v1
- Date: Mon, 19 Feb 2024 15:42:54 GMT
- Title: Empirical Study on Updating Key-Value Memories in Transformer
Feed-forward Layers
- Authors: Zihan Qiu, Zeyu Huang, Youcheng Huang and Jie Fu
- Abstract summary: The feed-forward networks (FFNs) in transformers are recognized as a group of key-value neural memories to restore abstract high-level knowledge.
We conduct an empirical ablation study on updating keys (the 1st layer in the FFNs layer) or values.
We compare those two methods in various knowledge editing and fine-tuning tasks of large language models to draw insights to understand FFNs further.
- Score: 27.636372947415186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The feed-forward networks (FFNs) in transformers are recognized as a group of
key-value neural memories to restore abstract high-level knowledge. In this
work, we conduct an empirical ablation study on updating keys (the 1st layer in
the FFNs layer) or values (the 2nd layer in the FFNs layer). We compare those
two methods in various knowledge editing and fine-tuning tasks of large
language models to draw insights to understand FFNs further. Code is available
at $\href{https://github.com/qiuzh20/Tuning-keys-v.s.-values}{this\,repo}$.
Related papers
- Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding [58.364933651703524]
We show that concentrated massive values consistently emerge in specific regions of attention queries.
These massive values play a critical role in interpreting contextual knowledge.
We trace the emergence of massive values and find that such concentration is caused by Rotary Positional.
arXiv Detail & Related papers (2025-02-03T17:47:03Z) - Reversible Decoupling Network for Single Image Reflection Removal [15.763420129991255]
High-level semantic clues tend to be compressed or discarded during layer-by-layer propagation.
We propose a novel architecture called Reversible Decoupling Network (RDNet)
RDNet employs a reversible encoder to secure valuable information while flexibly decoupling transmission- and reflection-relevant features during the forward pass.
arXiv Detail & Related papers (2024-10-10T15:58:27Z) - How Large Language Models Encode Context Knowledge? A Layer-Wise Probing
Study [27.23388511249688]
This paper investigates the layer-wise capability of large language models to encode knowledge.
We leverage the powerful generative capability of ChatGPT to construct probing datasets.
Experiments on conflicting and newly acquired knowledge show that LLMs prefer to encode more context knowledge in the upper layers.
arXiv Detail & Related papers (2024-02-25T11:15:42Z) - A Study on ReLU and Softmax in Transformer [51.0740713922741]
The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories.
We first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax.
In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large.
arXiv Detail & Related papers (2023-02-13T15:41:20Z) - Technical Report: Combining knowledge from Transfer Learning during
training and Wide Resnets [2.3859169601259342]
We combine the idea of Wide ResNets and transfer learning to optimize the architecture of deep neural networks.
The first improvement of the architecture is the use of all layers as information source for the last layer.
The second improvement is the use of deeper layers instead of deeper sequences of blocks.
arXiv Detail & Related papers (2022-06-20T10:40:59Z) - Kformer: Knowledge Injection in Transformer Feed-Forward Layers [107.71576133833148]
We propose a novel knowledge fusion model, namely Kformer, which incorporates external knowledge through the feed-forward layer in Transformer.
We empirically find that simply injecting knowledge into FFN can facilitate the pre-trained language model's ability and facilitate current knowledge fusion methods.
arXiv Detail & Related papers (2022-01-15T03:00:27Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - Train your classifier first: Cascade Neural Networks Training from upper
layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers.
We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks.
The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z) - Transformer Feed-Forward Layers Are Key-Value Memories [49.52087581977751]
We show that feed-forward layers in transformer-based language models operate as key-value memories.
We show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones.
arXiv Detail & Related papers (2020-12-29T19:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.