FCM: Forgetful Causal Masking Makes Causal Language Models Better
  Zero-Shot Learners
        - URL: http://arxiv.org/abs/2210.13432v1
- Date: Mon, 24 Oct 2022 17:46:57 GMT
- Title: FCM: Forgetful Causal Masking Makes Causal Language Models Better
  Zero-Shot Learners
- Authors: Hao Liu, Xinyang Geng, Lisa Lee, Igor Mordatch, Sergey Levine, Sharan
  Narang, Pieter Abbeel
- Abstract summary: We propose a simple technique that significantly boosts the performance of large language models without adding computational cost.
Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations.
 Experimental results show that our method also improves PaLM's zero and few-shot performance on a diverse suite of tasks.
- Score: 139.6321017962092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Large language models (LLM) trained using the next-token-prediction
objective, such as GPT3 and PaLM, have revolutionized natural language
processing in recent years by showing impressive zero-shot and few-shot
capabilities across a wide range of tasks. In this work, we propose a simple
technique that significantly boosts the performance of LLMs without adding
computational cost. Our key observation is that, by performing the next token
prediction task with randomly selected past tokens masked out, we can improve
the quality of the learned representations for downstream language
understanding tasks. We hypothesize that randomly masking past tokens prevents
over-attending to recent tokens and encourages attention to tokens in the
distant past. By randomly masking input tokens in the PaLM model, we show that
we can significantly improve 1B and 8B PaLM's zero-shot performance on the
SuperGLUE benchmark from 55.7 to 59.2 and from 61.6 to 64.0, respectively. Our
largest 8B model matches the score of PaLM with an average score of 64, despite
the fact that PaLM is trained on a much larger dataset (780B tokens) of
high-quality conversation and webpage data, while ours is trained on the
smaller C4 dataset (180B tokens). Experimental results show that our method
also improves PaLM's zero and few-shot performance on a diverse suite of tasks,
including commonsense reasoning, natural language inference and cloze
completion. Moreover, we show that our technique also helps representation
learning, significantly improving PaLM's finetuning results.
 
      
        Related papers
        - Not all tokens are created equal: Perplexity Attention Weighted Networks   for AI generated text detection [49.15148871877941]
 Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)
We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.
PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
 arXiv  Detail & Related papers  (2025-01-07T17:00:49Z)
- Matryoshka Query Transformer for Large Vision-Language Models [103.84600181927884]
 We introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference.
We train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens.
Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576.
 arXiv  Detail & Related papers  (2024-05-29T17:39:42Z)
- Understanding the Role of Input Token Characters in Language Models: How
  Does Information Loss Affect Performance? [45.53600782873268]
 We study how information loss in input token characters affects the performance of pre-training language models.
Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks is high.
For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately $90$% and $77$% of the full-token model in SuperGLUE and GLUE tasks, respectively.
 arXiv  Detail & Related papers  (2023-10-26T09:47:50Z)
- Masked and Permuted Implicit Context Learning for Scene Text Recognition [8.742571493814326]
 Scene Recognition (STR) is difficult because of variations in text styles, shapes, and backgrounds.
We propose a masked and permuted implicit context learning network for STR, within a single decoder.
 arXiv  Detail & Related papers  (2023-05-25T15:31:02Z)
- Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
 We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
 arXiv  Detail & Related papers  (2023-05-09T11:00:02Z)
- PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
 We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
 arXiv  Detail & Related papers  (2022-04-05T16:11:45Z)
- Protum: A New Method For Prompt Tuning Based on "[MASK]" [12.057434751507552]
 We propose a new textbfPrompt textbfTuning based on "[textbfMASK]" (textbfProtum) method in this paper.
Our textbfProtum can achieve much better performance than fine-tuning after continuous pre-training with less time consumption.
 arXiv  Detail & Related papers  (2022-01-28T13:34:30Z)
- Frustratingly Simple Pretraining Alternatives to Masked Language
  Modeling [10.732163031244651]
 Masked language modeling (MLM) is widely used in natural language processing for learning text representations.
In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of representations.
 arXiv  Detail & Related papers  (2021-09-04T08:52:37Z)
- MST: Masked Self-Supervised Transformer for Visual Representation [52.099722121603506]
 Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP)
We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image.
MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
 arXiv  Detail & Related papers  (2021-06-10T11:05:18Z)
- COCO-LM: Correcting and Contrasting Text Sequences for Language Model
  Pretraining [59.169836983883656]
 COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
 COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
 arXiv  Detail & Related papers  (2021-02-16T22:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.