Related papers: RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

URL: http://arxiv.org/abs/2404.07839v1
Date: Thu, 11 Apr 2024 15:27:22 GMT
Title: RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Authors: Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas,
Abstract summary: We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language.
Score: 103.59785165735727
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.

Related papers

Taipan: Efficient and Expressive State Space Language Models with Selective Attention [100.16383527459429]
Long-context language modeling is a significant challenge in Natural Language Processing (NLP) Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs) Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
arXiv Detail & Related papers (2024-10-24T09:25:37Z)
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models. We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z)
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models [101.70220733111164]
We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.
arXiv Detail & Related papers (2024-02-29T18:24:46Z)
Simple linear attention language models balance the recall-throughput tradeoff [60.06020449520365]
We propose BASED, a simple architecture combining linear and sliding window attention.<n>We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z)
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models. We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method. We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z)
Training Language Models with Memory Augmentation [28.4608705738799]
We present a novel training approach designed for training language models with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We demonstrate significant gains over previous memory-augmented approaches.
arXiv Detail & Related papers (2022-05-25T11:37:29Z)
Improving language models by retrieving from trillions of tokens [50.42630445476544]
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile.
arXiv Detail & Related papers (2021-12-08T17:32:34Z)
CoreLM: Coreference-aware Language Model Fine-Tuning [0.0]
We propose a Fine-Tuning framework, named CoreLM, that extends the architecture of current Pretrained Language Models. We make available information outside the contextual space of the model, which results in a better Language Model for a fraction of the computational cost. Our proposed model achieves a lower Perplexity in GUMBY and LAMBDADA datasets when compared to GPT2 and a fine-tuned version of GPT2 without any changes.
arXiv Detail & Related papers (2021-11-04T08:44:31Z)
Adaptive Semiparametric Language Models [17.53604394786977]
We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component. Experiments on word-based and character-based language modeling datasets demonstrate the efficacy of our proposed method.
arXiv Detail & Related papers (2021-02-04T11:47:03Z)
Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size [41.624797099537375]
We present a novel method for applying pretrained transformer language models. We find that our method attains better perplexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora.
arXiv Detail & Related papers (2020-08-16T23:19:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.