Related papers: Segment-Based Attention Masking for GPTs

Segment-Based Attention Masking for GPTs

URL: http://arxiv.org/abs/2412.18487v1
Date: Tue, 24 Dec 2024 15:18:52 GMT
Title: Segment-Based Attention Masking for GPTs
Authors: Shahar Katz, Liran Ringel, Yaniv Romano, Lior Wolf,
Abstract summary: causal masking is applied to all input tokens step-by-step, mimicking the generation process.<n>In this work, attention is masked based on the known block structure at the prefill phase.<n>When integrating it into models such as Llama and Qwen, state-of-the-art performance is consistently achieved.
Score: 57.69161357477644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern Language Models (LMs) owe much of their success to masked causal attention, the backbone of Generative Pre-Trained Transformer (GPT) models. Although GPTs can process the entire user prompt at once, the causal masking is applied to all input tokens step-by-step, mimicking the generation process. This imposes an unnecessary constraint during the initial "prefill" phase when the model processes the input prompt and generates the internal representations before producing any output tokens. In this work, attention is masked based on the known block structure at the prefill phase, followed by the conventional token-by-token autoregressive process after that. For example, in a typical chat prompt, the system prompt is treated as one block, and the user prompt as the next one. Each of these is treated as a unit for the purpose of masking, such that the first tokens in each block can access the subsequent tokens in a non-causal manner. Then, the model answer is generated in the conventional causal manner. This Segment-by-Segment scheme entails no additional computational overhead. When integrating it into models such as Llama and Qwen, state-of-the-art performance is consistently achieved.

Related papers

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism. Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP) SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z)
Token Alignment via Character Matching for Subword Completion [34.76794239097628]
This paper examines a technique to alleviate the tokenization artifact on text completion in generative models. The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model's generation aligns with the prompt.
arXiv Detail & Related papers (2024-03-13T16:44:39Z)
Segment (Almost) Nothing: Prompt-Agnostic Adversarial Attacks on Segmentation Models [61.46999584579775]
General purpose segmentation models are able to generate (semantic) segmentation masks from a variety of prompts. In particular, input images are pre-processed by an image encoder to obtain embedding vectors which are later used for mask predictions. We show that even imperceptible perturbations of radius $epsilon=1/255$ are often sufficient to drastically modify the masks predicted with point, box and text prompts.
arXiv Detail & Related papers (2023-11-24T12:57:34Z)
Position-based Prompting for Health Outcome Generation [0.0]
We explore an idea of using a position-attention mechanism to capture positional information of each word in a prompt relative to the mask to be filled. Our approach consistently outperforms a baseline in which the default mask language model (MLM) representation is used to predict masked tokens.
arXiv Detail & Related papers (2022-03-30T16:44:04Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
Position Masking for Language Models [0.0]
Masked language modeling (MLM) pre-training models such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. We propose to expand upon this idea by masking the positions of some tokens along with the masked input token ids.
arXiv Detail & Related papers (2020-06-02T23:40:41Z)
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.