Mask Attention Networks: Rethinking and Strengthen Transformer
- URL: http://arxiv.org/abs/2103.13597v1
- Date: Thu, 25 Mar 2021 04:07:44 GMT
- Title: Mask Attention Networks: Rethinking and Strengthen Transformer
- Authors: Zhihao Fan, Yeyun Gong, Dayiheng Liu, Zhongyu Wei, Siyuan Wang, Jian
Jiao, Nan Duan, Ruofei Zhang, Xuanjing Huang
- Abstract summary: Transformer is an attention-based neural network, which consists of two sublayers, Self-Attention Network (SAN) and Feed-Forward Network (FFN)
- Score: 70.95528238937861
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer is an attention-based neural network, which consists of two
sublayers, namely, Self-Attention Network (SAN) and Feed-Forward Network (FFN).
Existing research explores to enhance the two sublayers separately to improve
the capability of Transformer for text representation. In this paper, we
present a novel understanding of SAN and FFN as Mask Attention Networks (MANs)
and show that they are two special cases of MANs with static mask matrices.
However, their static mask matrices limit the capability for localness modeling
in text representation learning. We therefore introduce a new layer named
dynamic mask attention network (DMAN) with a learnable mask matrix which is
able to model localness adaptively. To incorporate advantages of DMAN, SAN, and
FFN, we propose a sequential layered structure to combine the three types of
layers. Extensive experiments on various tasks, including neural machine
translation and text summarization demonstrate that our model outperforms the
original Transformer.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Hyper-Transformer for Amodal Completion [82.4118011026855]
Amodal object completion is a complex task that involves predicting the invisible parts of an object based on visible segments and background information.
We introduce a novel framework named the Hyper-Transformer Amodal Network (H-TAN)
This framework utilizes a hyper transformer equipped with a dynamic convolution head to directly learn shape priors and accurately predict amodal masks.
arXiv Detail & Related papers (2024-05-30T11:11:54Z) - MoMask: Generative Masked Modeling of 3D Human Motions [25.168781728071046]
MoMask is a novel framework for text-driven 3D human motion generation.
A hierarchical quantization scheme is employed to represent human motion as discrete motion tokens.
MoMask outperforms state-of-art methods on the text-to-motion generation task.
arXiv Detail & Related papers (2023-11-29T19:04:10Z) - Toward a Deeper Understanding: RetNet Viewed through Convolution [25.8904146140577]
Vision Transformer (ViT) can learn global dependencies superior to CNN, yet CNN's inherent locality can substitute for expensive training resources.
This paper investigates the effectiveness of RetNet from a CNN perspective and presents a variant of RetNet tailored to the visual domain.
We propose a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks.
arXiv Detail & Related papers (2023-09-11T10:54:22Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Parameter-Efficient Masking Networks [61.43995077575439]
Advanced network designs often contain a large number of repetitive structures (e.g., Transformer)
In this study, we are the first to investigate the representative potential of fixed random weights with limited unique values by learning masks.
It leads to a new paradigm for model compression to diminish the model size.
arXiv Detail & Related papers (2022-10-13T03:39:03Z) - AFNet-M: Adaptive Fusion Network with Masks for 2D+3D Facial Expression
Recognition [1.8604727699812171]
2D+3D facial expression recognition (FER) can effectively cope with illumination changes and pose variations.
Most deep learning-based approaches employ the simple fusion strategy.
We propose the adaptive fusion network with masks (AFNet-M) for 2D+3D FER.
arXiv Detail & Related papers (2022-05-24T04:56:55Z) - UFO: A UniFied TransfOrmer for Vision-Language Representation Learning [54.82482779792115]
We propose a single UniFied transfOrmer (UFO) capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question) for vision-language (VL) representation learning.
Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks.
arXiv Detail & Related papers (2021-11-19T03:23:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.