StableMask: Refining Causal Masking in Decoder-only Transformer
- URL: http://arxiv.org/abs/2402.04779v1
- Date: Wed, 7 Feb 2024 12:01:02 GMT
- Title: StableMask: Refining Causal Masking in Decoder-only Transformer
- Authors: Qingyu Yin, Xuzheng He, Xiang Zhuang, Yu Zhao, Jianhua Yao, Xiaoyu
Shen, Qiang Zhang
- Abstract summary: decoder-only Transformer architecture with causal masking and relative position encoding (RPE) has become the de facto choice in language modeling.
However, it requires all attention scores to be non-zero and sum up to 1, even if the current embedding has sufficient self-contained information.
We propose StableMask: a parameter-free method to address both limitations by refining the causal mask.
- Score: 22.75632485195928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The decoder-only Transformer architecture with causal masking and relative
position encoding (RPE) has become the de facto choice in language modeling.
Despite its exceptional performance across various tasks, we have identified
two limitations: First, it requires all attention scores to be non-zero and sum
up to 1, even if the current embedding has sufficient self-contained
information. This compels the model to assign disproportional excessive
attention to specific tokens. Second, RPE-based Transformers are not universal
approximators due to their limited capacity at encoding absolute positional
information, which limits their application in position-critical tasks. In this
work, we propose StableMask: a parameter-free method to address both
limitations by refining the causal mask. It introduces pseudo-attention values
to balance attention distributions and encodes absolute positional information
via a progressively decreasing mask ratio. StableMask's effectiveness is
validated both theoretically and empirically, showing significant enhancements
in language models with parameter sizes ranging from 71M to 1.4B across diverse
datasets and encoding methods. We further show that it naturally supports (1)
efficient extrapolation without special tricks such as StreamingLLM and (2)
easy integration with existing attention optimization techniques.
Related papers
- AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model [28.343378406337077]
We propose an automated prompting and mask calibration method called AM-SAM.
Our approach automatically generates prompts for an input image, eliminating the need for human involvement with a good performance in early training epochs.
Our experimental results demonstrate that AM-SAM achieves significantly accurate segmentation, matching or exceeding the effectiveness of human-generated and default prompts.
arXiv Detail & Related papers (2024-10-13T03:47:20Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information [5.756323337411276]
Large Language Models (LLMs) have advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis.
Their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment.
We propose Athena, a novel algorithm for efficient block-wise post-training quantization of LLMs.
arXiv Detail & Related papers (2024-05-24T03:14:29Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - Towards Compact 3D Representations via Point Feature Enhancement Masked
Autoencoders [52.66195794216989]
We propose Point Feature Enhancement Masked Autoencoders (Point-FEMAE) to learn compact 3D representations.
Point-FEMAE consists of a global branch and a local branch to capture latent semantic features.
Our method significantly improves the pre-training efficiency compared to cross-modal alternatives.
arXiv Detail & Related papers (2023-12-17T14:17:05Z) - Efficient Masked Autoencoders with Self-Consistency [34.7076436760695]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision.
We propose efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency.
EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T09:21:12Z) - Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone.
We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z) - DecBERT: Enhancing the Language Understanding of BERT with Causal
Attention Masks [33.558503823505056]
In this work, we focus on improving the position encoding ability of BERT with the causal attention masks.
We propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark.
Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE benchmark; and (3) our modification accelerates the pre-training process and DecBERT achieves better overall performance than the baseline systems.
arXiv Detail & Related papers (2022-04-19T06:12:48Z) - Masked Autoencoders for Point Cloud Self-supervised Learning [27.894216954216716]
We propose a neat scheme of masked autoencoders for point cloud self-supervised learning.
We divide the input point cloud into irregular point patches and randomly mask them at a high ratio.
A standard Transformer based autoencoder, with an asymmetric design and a shifting mask tokens operation, learns high-level latent features from unmasked point patches.
arXiv Detail & Related papers (2022-03-13T09:23:39Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Image Inpainting by End-to-End Cascaded Refinement with Mask Awareness [66.55719330810547]
Inpainting arbitrary missing regions is challenging because learning valid features for various masked regions is nontrivial.
We propose a novel mask-aware inpainting solution that learns multi-scale features for missing regions in the encoding phase.
Our framework is validated both quantitatively and qualitatively via extensive experiments on three public datasets.
arXiv Detail & Related papers (2021-04-28T13:17:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.