Related papers: RWKV-X: A Linear Complexity Hybrid Language Model

RWKV-X: A Linear Complexity Hybrid Language Model

URL: http://arxiv.org/abs/2504.21463v1
Date: Wed, 30 Apr 2025 09:38:17 GMT
Title: RWKV-X: A Linear Complexity Hybrid Language Model
Authors: Haowen Hou, Zhiyi Huang, Kaifeng Tan, Rongchang Lu, Fei Richard Yu,
Abstract summary: We introduce textbfRWKV-X, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context.<n>We demonstrate that RWKV-X, when continually pretrained on 64K-token sequences, achieves near-perfect accuracy on the 64K passkey retrieval benchmark.<n>These results highlight RWKV-X as a scalable and efficient backbone for general-purpose language modeling, capable of decoding sequences up to 1 million tokens with stable speed and memory usage.
Score: 7.74296978323232
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce \textbf{RWKV-X}, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches that rely on full attention layers and retain quadratic complexity, RWKV-X achieves linear-time complexity in training and constant-time complexity in inference decoding. We demonstrate that RWKV-X, when continually pretrained on 64K-token sequences, achieves near-perfect accuracy on the 64K passkey retrieval benchmark. It consistently outperforms prior RWKV-7 models on long-context benchmarks, while maintaining strong performance on short-context tasks. These results highlight RWKV-X as a scalable and efficient backbone for general-purpose language modeling, capable of decoding sequences up to 1 million tokens with stable speed and memory usage. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at: https://github.com/howard-hou/RWKV-X.

Related papers

Rectified Sparse Attention [61.7702154360081]
Efficient long-sequence generation is a critical challenge for Large Language Models.<n>We propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification.<n> Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality.
arXiv Detail & Related papers (2025-06-04T16:01:48Z)
Cross-attention for State-based model RWKV-7 [0.747193191854175]
CrossWKV is a novel cross-attention mechanism for the state-based RWKV-7 model.<n>CrossWKV integrates text and image modalities in a single pass.<n>The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks.
arXiv Detail & Related papers (2025-04-19T10:47:51Z)
Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner [0.747193191854175]
State-based sequence models like RWKV-7 offer a compelling alternative to Transformer architectures.<n>We propose textbfMeta-State, a novel extension to RWKV-7 that replaces attention mechanisms with a fully state-driven approach.
arXiv Detail & Related papers (2025-04-11T04:14:32Z)
Enhancing RWKV-based Language Models for Long-Sequence Text Generation [0.0]
This paper introduces an enhanced RWKV architecture with adaptive temporal gating mechanisms for improved long-context language modeling. We propose two principal innovations: (1) a position-aware convolutional shift operator that captures local syntactic patterns while preserving global coherence, and (2) a neurally-gated information routing mechanism that dynamically regulates inter-token information flow.
arXiv Detail & Related papers (2025-02-21T14:18:18Z)
Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.<n>Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z)
RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation [39.11918061481855]
We propose RWKV-UNet, a novel model that integrates the RWKV structure into the U-Net architecture.<n>This integration enhances the model's ability to capture long-range dependencies and improve contextual understanding.<n>We show that RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation.
arXiv Detail & Related papers (2025-01-14T22:03:00Z)
Tensor Product Attention Is All You Need [54.40495407154611]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly. TPA achieves improved model quality alongside memory efficiency. We introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z)
RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.<n>Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z)
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs) This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z)
Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model [138.20621211946985]
Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. In this work, we focus on designing an efficient segment-anything model by exploring different architectures. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models.
arXiv Detail & Related papers (2024-06-27T17:49:25Z)
LoCoCo: Dropping In Convolutions for Long Context Compression [77.26610232994508]
This paper presents a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo) LoCoCo employs only a fixed-size Key-Value ( KV) cache, and can enhance efficiency in both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-06-08T01:35:11Z)
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence [36.97507697713224]
We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality.
arXiv Detail & Related papers (2024-04-08T22:20:59Z)
LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z)
RRWKV: Capturing Long-range Dependencies in RWKV [0.0]
The paper devises the Retrospected Receptance Weighted Key Value architecture via incorporating the retrospecting ability into the RWKV to effectively absorb information. RWKV has exploited a linearly tensor-product attention mechanism and achieved parallelized computations by deploying the time-sequential mode.
arXiv Detail & Related papers (2023-06-08T13:17:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.