Related papers: Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner

Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner

URL: http://arxiv.org/abs/2504.08247v1
Date: Fri, 11 Apr 2025 04:14:32 GMT
Title: Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner
Authors: Liu Xiao, Li Zhiyuan, Lin Yueyu,
Abstract summary: State-based sequence models like RWKV-7 offer a compelling alternative to Transformer architectures.<n>We propose textbfMeta-State, a novel extension to RWKV-7 that replaces attention mechanisms with a fully state-driven approach.
Score: 0.747193191854175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-based sequence models like RWKV-7 offer a compelling alternative to Transformer architectures, achieving linear complexity while demonstrating greater expressive power in short-context scenarios and enabling state tracking beyond the \(\text{TC}^0\) complexity class. However, RWKV-7 lacks mechanisms for token-parameter interactions and native scalability, limiting its adaptability and growth without retraining. In this paper, we propose \textbf{Meta-State}, a novel extension to RWKV-7 that replaces attention mechanisms with a fully state-driven approach, integrating token-parameter interactions through a \textbf{Self-State Encoder} (SSE) mechanism. The SSE repurposes a portion of the RWKV-7 Weighted Key-Value (WKV) state as transformation weights to encode token-parameter interactions in a linear, state-driven manner without introducing new trainable matrices or softmax operations, while preserving the autoregressive property of token processing. Meta-State supports progressive model scaling by expanding the WKV state and parameter tokens, reusing existing parameters without retraining. Our approach bridges the gap between state-based modeling, token-parameter interactions, and scalable architectures, offering a flexible framework for efficient and adaptable sequence modeling with linear complexity and constant memory usage.

Related papers

Scaling Linear Attention with Sparse State Expansion [58.161410995744596]
Transformer architecture struggles with long-context scenarios due to quadratic computation and linear memory growth.<n>We introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification.<n>Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions.
arXiv Detail & Related papers (2025-07-22T13:27:31Z)
RWKV-X: A Linear Complexity Hybrid Language Model [7.74296978323232]
We introduce textbfRWKV-X, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context. We demonstrate that RWKV-X, when continually pretrained on 64K-token sequences, achieves near-perfect accuracy on the 64K passkey retrieval benchmark. These results highlight RWKV-X as a scalable and efficient backbone for general-purpose language modeling, capable of decoding sequences up to 1 million tokens with stable speed and memory usage.
arXiv Detail & Related papers (2025-04-30T09:38:17Z)
Cross-attention for State-based model RWKV-7 [0.747193191854175]
CrossWKV is a novel cross-attention mechanism for the state-based RWKV-7 model. CrossWKV integrates text and image modalities in a single pass. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks.
arXiv Detail & Related papers (2025-04-19T10:47:51Z)
Tensor Product Attention Is All You Need [54.40495407154611]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.<n>TPA achieves improved model quality alongside memory efficiency.<n>We introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z)
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters [102.1116808722299]
We introduce TokenFormer, a scalable architecture for scaling Transformers.<n>By treating model parameters as tokens, we replace all the linear projections in Transformers.<n>Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs.
arXiv Detail & Related papers (2024-10-30T16:19:00Z)
Symmetric Dot-Product Attention for Efficient Training of BERT Language Models [5.838117137253223]
We propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation.
arXiv Detail & Related papers (2024-06-10T15:24:15Z)
Sparse Modular Activation for Efficient Sequence Modeling [94.11125833685583]
Recent models combining Linear State Space Models with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks. Current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. We introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely activate sub-modules for sequence elements in a differentiable manner.
arXiv Detail & Related papers (2023-06-19T23:10:02Z)
RRWKV: Capturing Long-range Dependencies in RWKV [0.0]
The paper devises the Retrospected Receptance Weighted Key Value architecture via incorporating the retrospecting ability into the RWKV to effectively absorb information. RWKV has exploited a linearly tensor-product attention mechanism and achieved parallelized computations by deploying the time-sequential mode.
arXiv Detail & Related papers (2023-06-08T13:17:06Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
FAENet: Frame Averaging Equivariant GNN for Materials Modeling [123.19473575281357]
We introduce a flexible framework relying on frameaveraging (SFA) to make any model E(3)-equivariant or invariant through data transformations. We prove the validity of our method theoretically and empirically demonstrate its superior accuracy and computational scalability in materials modeling.
arXiv Detail & Related papers (2023-04-28T21:48:31Z)
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture. We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.