Related papers: ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

URL: http://arxiv.org/abs/2501.15570v1
Date: Sun, 26 Jan 2025 15:56:56 GMT
Title: ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
Authors: Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao,
Abstract summary: We introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention.<n>We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours.<n>In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens.
Score: 0.6839746711757702
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.

Related papers

RWKV-7 "Goose" with Expressive Dynamic State Evolution [16.339399279238464]
We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter language model achieves a new 3B SoTA on multilingual tasks. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training.
arXiv Detail & Related papers (2025-03-18T17:31:05Z)
BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling [0.8397702677752039]
Time series models face challenges in scaling to handle large and complex datasets. We propose a novel solution using RWKV-7, which incorporates meta-learning into its state update mechanism. We achieve a substantial performance improvement of approximately 1.13 to 43.3x and a 4.5x reduction in training time with 1/23 parameters.
arXiv Detail & Related papers (2025-03-08T08:31:18Z)
Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z)
VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models [10.272476734387977]
We introduce VisualRWKV, the first application of a linear RNN model to multimodal learning tasks.<n>We propose a data-dependent recurrence and sandwich prompts to enhance our modeling capabilities.<n>VisualRWKV achieves competitive performance compared to Transformer-based models like LLaVA-1.5 on various benchmarks.
arXiv Detail & Related papers (2024-06-19T09:07:31Z)
PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning [56.14518823931901]
We present PointRWKV, a model of linear complexity derived from the RWKV model in the NLP field. We first propose to explore the global processing capabilities within PointRWKV blocks using modified multi-headed matrix-valued states. To extract local geometric features simultaneously, we design a parallel branch to encode the point cloud efficiently in a fixed radius near-neighbors graph with a graph stabilizer.
arXiv Detail & Related papers (2024-05-24T05:02:51Z)
Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently. We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm. We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z)
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence [36.97507697713224]
We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality.
arXiv Detail & Related papers (2024-04-08T22:20:59Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
RWKV-TS: Beyond Traditional Recurrent Neural Network for Time Series Tasks [42.27646976600047]
Traditional Recurrent Neural Network (RNN) architectures have historically held prominence in time series tasks. Recent advancements in time series forecasting have seen a shift away from RNNs to tasks such as Transformers, and CNNs. We design an efficient RNN-based model for time series tasks, named RWKV-TS, with three distinctive features.
arXiv Detail & Related papers (2024-01-17T09:56:10Z)
An Empirical Study of Training End-to-End Vision-and-Language Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)
Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.