Masked Mixers for Language Generation and Retrieval
- URL: http://arxiv.org/abs/2409.01482v1
- Date: Mon, 2 Sep 2024 22:17:18 GMT
- Title: Masked Mixers for Language Generation and Retrieval
- Authors: Benjamin L. Badger,
- Abstract summary: We observe poor input representation accuracy in transformers, but find more accurate representation in masked mixers.
Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations.
We introduce an efficient training approach for retrieval models based on existing generative model embeddings.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most information present in the input is necessarily lost. In support of this idea we observe poor input representation accuracy in transformers, but find more accurate representation in what we term masked mixers which replace self-attention with masked convolutions. Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations and somewhat less efficiently than optimized, current implementations. The most efficient learning algorithm observed for this dataset is a transformer-masked mixer hybrid, suggesting that these models learn in an orthogonal manner. We hypothesized that the information loss exhibited by transformers would be much more detrimental to retrieval than generation, and to test this we introduce an efficient training approach for retrieval models based on existing generative model embeddings. With this method, embeddings from masked mixers are found to result in far better summary-to-story retrieval compared to embeddings from transformers.
Related papers
- ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition [5.311735227179715]
We explore and devise a novel ConvMixFormer architecture for dynamic hand gestures.
The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets.
Our model has achieved state-of-the-art results on single and multimodal inputs.
arXiv Detail & Related papers (2024-11-11T16:45:18Z) - Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers.
Specifically, we randomly mask out a high proportion of patches in diffused input images during training.
Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z) - AttMEMO : Accelerating Transformers with Memoization on Big Memory
Systems [10.585040856070941]
We introduce a novel embedding technique to find semantically similar inputs to identify computation similarity.
We enable 22% inference-latency reduction on average (up to 68%) with negligible loss in inference accuracy.
arXiv Detail & Related papers (2023-01-23T04:24:26Z) - AtMan: Understanding Transformer Predictions Through Memory Efficient
Attention Manipulation [25.577132500246886]
We present AtMan, which provides explanations of generative transformer models at almost no extra cost.
AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input.
Our experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics.
arXiv Detail & Related papers (2023-01-19T15:01:00Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it.
Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels.
In this paper, we explore how to apply mixup to natural language processing tasks.
We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.