Incorporating Residual and Normalization Layers into Analysis of Masked
Language Models
- URL: http://arxiv.org/abs/2109.07152v1
- Date: Wed, 15 Sep 2021 08:32:20 GMT
- Title: Incorporating Residual and Normalization Layers into Analysis of Masked
Language Models
- Authors: Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui
- Abstract summary: We extend the scope of the analysis of Transformers from solely the attention patterns to the whole attention block.
Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed.
- Score: 29.828669678974983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer architecture has become ubiquitous in the natural language
processing field. To interpret the Transformer-based models, their attention
patterns have been extensively analyzed. However, the Transformer architecture
is not only composed of the multi-head attention; other components can also
contribute to Transformers' progressive performance. In this study, we extended
the scope of the analysis of Transformers from solely the attention patterns to
the whole attention block, i.e., multi-head attention, residual connection, and
layer normalization. Our analysis of Transformer-based masked language models
shows that the token-to-token interaction performed via attention has less
impact on the intermediate representations than previously assumed. These
results provide new intuitive explanations of existing reports; for example,
discarding the learned attention patterns tends not to adversely affect the
performance. The codes of our experiments are publicly available.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.