Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
- URL: http://arxiv.org/abs/2404.05892v4
- Date: Thu, 26 Sep 2024 22:39:08 GMT
- Title: Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
- Authors: Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV, Jan Kocoń, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Jiaju Lin, Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Cahya Wirawan, Stanisław Woźniak, Ruichong Zhang, Bingchen Zhao, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu,
- Abstract summary: We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture.
Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism.
We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality.
- Score: 36.97507697713224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality. We trained four Eagle models, ranging from 0.46 to 7.5 billion parameters, and two Finch models with 1.6 and 3.1 billion parameters and find that they achieve competitive performance across a wide variety of benchmarks. We release all our models on HuggingFace under the Apache 2.0 license. Models at: https://huggingface.co/RWKV Training code at: https://github.com/RWKV/RWKV-LM Inference code at: https://github.com/RWKV/ChatRWKV Time-parallel training code at: https://github.com/RWKV/RWKV-infctx-trainer
Related papers
- ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer [0.6839746711757702]
We introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention.
We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours.
In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens.
arXiv Detail & Related papers (2025-01-26T15:56:56Z) - NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition [80.22784377150465]
Handwritten Mathematical Expression Recognition (HMER) has gained considerable attention in pattern recognition for its diverse applications in document understanding.
This paper makes the first attempt to build a novel bottom-up Non-AutoRegressive Modeling approach for HMER, called NAMER.
NAMER comprises a Visual Aware Tokenizer (VAT) and a Parallel Graph (PGD)
arXiv Detail & Related papers (2024-07-16T04:52:39Z) - VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models [10.272476734387977]
We introduce VisualRWKV, the first application of a linear RNN model to multimodal learning tasks.
We propose a data-dependent recurrence and sandwich prompts to enhance our modeling capabilities.
VisualRWKV achieves competitive performance compared to Transformer-based models like LLaVA-1.5 on various benchmarks.
arXiv Detail & Related papers (2024-06-19T09:07:31Z) - PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning [56.14518823931901]
We present PointRWKV, a model of linear complexity derived from the RWKV model in the NLP field.
We first propose to explore the global processing capabilities within PointRWKV blocks using modified multi-headed matrix-valued states.
To extract local geometric features simultaneously, we design a parallel branch to encode the point cloud efficiently in a fixed radius near-neighbors graph with a graph stabilizer.
arXiv Detail & Related papers (2024-05-24T05:02:51Z) - RETVec: Resilient and Efficient Text Vectorizer [5.181952693002194]
RETVec combines a novel character encoding with an optional small embedding model to embed words into a 256-dimensional vector space.
The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks.
arXiv Detail & Related papers (2023-02-18T02:06:52Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - UniFormer: Unifying Convolution and Self-attention for Visual
Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format.
Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z) - COMBO: State-of-the-Art Morphosyntactic Analysis [0.0]
COMBO is a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing.
It predicts categorical morphosyntactic features whilst also exposing their vector representations, extracted from hidden layers.
It is an easy to install Python package with automatically downloadable pre-trained models for over 40 languages.
arXiv Detail & Related papers (2021-09-11T20:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.