Zoology: Measuring and Improving Recall in Efficient Language Models
- URL: http://arxiv.org/abs/2312.04927v1
- Date: Fri, 8 Dec 2023 09:44:25 GMT
- Title: Zoology: Measuring and Improving Recall in Efficient Language Models
- Authors: Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael
Poli, James Zou, Atri Rudra, and Christopher R\'e
- Abstract summary: We train a suite of 17 attention and "gated-convolution" language models.
We find that gated-convolution architectures still underperform attention by up to 2.1 perplexity points on the Pile.
We develop a new formalization of the task called multi-query associative recall (MQAR) that better reflects actual language.
- Score: 42.159338928861864
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Attention-free language models that combine gating and convolutions are
growing in popularity due to their efficiency and increasingly competitive
performance. To better understand these architectures, we pretrain a suite of
17 attention and "gated-convolution" language models, finding that SoTA
gated-convolution architectures still underperform attention by up to 2.1
perplexity points on the Pile. In fine-grained analysis, we find 82% of the gap
is explained by each model's ability to recall information that is previously
mentioned in-context, e.g. "Hakuna Matata means no worries Hakuna Matata it
means no" $\rightarrow$ "??". On this task, termed "associative recall", we
find that attention outperforms gated-convolutions by a large margin: a 70M
parameter attention model outperforms a 1.4 billion parameter gated-convolution
model on associative recall. This is surprising because prior work shows gated
convolutions can perfectly solve synthetic tests for AR capability. To close
the gap between synthetics and real language, we develop a new formalization of
the task called multi-query associative recall (MQAR) that better reflects
actual language. We perform an empirical and theoretical study of MQAR that
elucidates differences in the parameter-efficiency of attention and
gated-convolution recall. Informed by our analysis, we evaluate simple
convolution-attention hybrids and show that hybrids with input-dependent sparse
attention patterns can close 97.4% of the gap to attention, while maintaining
sub-quadratic scaling. Our code is accessible at:
https://github.com/HazyResearch/zoology.
Related papers
- Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Honest AI: Fine-Tuning "Small" Language Models to Say "I Don't Know", and Reducing Hallucination in RAG [6.326488286636623]
Hallucination is a key roadblock for applications of Large Language Models (LLMs)
We propose Honest AI: a novel strategy to fine-tune "small" language models to say "I don't know" to reduce hallucination.
arXiv Detail & Related papers (2024-10-13T02:34:47Z) - DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs [70.54226917774933]
We propose the DecompositionAlignment-Reasoning Agent (DARA) framework.
DARA effectively parses questions into formal queries through a dual mechanism.
We show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA.
arXiv Detail & Related papers (2024-06-11T09:09:37Z) - Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results.
We develop a method that avoids introducing external resources, relying instead on perturbations to the input.
Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z) - Simple linear attention language models balance the recall-throughput
tradeoff [40.08746299497935]
We propose BASED, a simple architecture combining linear and sliding window attention.
We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z) - CodeArt: Better Code Models by Attention Regularization When Symbols Are
Lacking [12.458135956476639]
Transformer based code models have impressive performance in many software engineering tasks.
However, their effectiveness degrades when symbols are missing or not informative.
We propose a new method to pre-train general code models when symbols are lacking.
arXiv Detail & Related papers (2024-02-19T05:13:22Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Towards Model-Size Agnostic, Compute-Free, Memorization-based Inference
of Deep Learning [5.41530201129053]
This paper proposes a novel memorization-based inference (MBI) that is compute free and only requires lookups.
Specifically, our work capitalizes on the inference mechanism of the recurrent attention model (RAM)
By leveraging the low-dimensionality of glimpse, our inference procedure stores key value pairs comprising of glimpse location, patch vector, etc. in a table.
The computations are obviated during inference by utilizing the table to read out key-value pairs and performing compute-free inference by memorization.
arXiv Detail & Related papers (2023-07-14T21:01:59Z) - Understanding Augmentation-based Self-Supervised Representation Learning
via RKHS Approximation and Regression [53.15502562048627]
Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator.
This work delves into a statistical analysis of augmentation-based pretraining.
arXiv Detail & Related papers (2023-06-01T15:18:55Z) - Accelerating Attention through Gradient-Based Learned Runtime Pruning [9.109136535767478]
Self-attention is a key enabler of state-of-art accuracy for transformer-based Natural Language Processing models.
This paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training.
We devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism.
arXiv Detail & Related papers (2022-04-07T05:31:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.