ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
- URL: http://arxiv.org/abs/2501.15570v1
- Date: Sun, 26 Jan 2025 15:56:56 GMT
- Title: ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
- Authors: Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao,
- Abstract summary: We introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention.
We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours.
In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens.
- Score: 0.6839746711757702
- License:
- Abstract: As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.
Related papers
- RWKV-Lite: Deeply Compressed RWKV for Resource-Constrained Devices [15.969537866628517]
We propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture.
Our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy.
arXiv Detail & Related papers (2024-12-14T15:11:07Z) - VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models [10.272476734387977]
We introduce VisualRWKV, the first application of a linear RNN model to multimodal learning tasks.
We propose a data-dependent recurrence and sandwich prompts to enhance our modeling capabilities.
VisualRWKV achieves competitive performance compared to Transformer-based models like LLaVA-1.5 on various benchmarks.
arXiv Detail & Related papers (2024-06-19T09:07:31Z) - PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning [56.14518823931901]
We present PointRWKV, a model of linear complexity derived from the RWKV model in the NLP field.
We first propose to explore the global processing capabilities within PointRWKV blocks using modified multi-headed matrix-valued states.
To extract local geometric features simultaneously, we design a parallel branch to encode the point cloud efficiently in a fixed radius near-neighbors graph with a graph stabilizer.
arXiv Detail & Related papers (2024-05-24T05:02:51Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence [36.97507697713224]
We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture.
Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism.
We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality.
arXiv Detail & Related papers (2024-04-08T22:20:59Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - RWKV-TS: Beyond Traditional Recurrent Neural Network for Time Series
Tasks [42.27646976600047]
Traditional Recurrent Neural Network (RNN) architectures have historically held prominence in time series tasks.
Recent advancements in time series forecasting have seen a shift away from RNNs to tasks such as Transformers, and CNNs.
We design an efficient RNN-based model for time series tasks, named RWKV-TS, with three distinctive features.
arXiv Detail & Related papers (2024-01-17T09:56:10Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Investigating Transfer Learning Capabilities of Vision Transformers and
CNNs by Fine-Tuning a Single Trainable Block [0.0]
transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch.
We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data.
We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
arXiv Detail & Related papers (2021-10-11T13:43:03Z) - Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.