Related papers: Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

URL: http://arxiv.org/abs/2601.06463v1
Date: Sat, 10 Jan 2026 07:12:41 GMT
Title: Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths
Authors: Xuezhe Ma, Shicheng Wen, Linghao Jin, Bilge Acun, Ruihang Lai, Bohan Hou, Will Lin, Hao Zhang, Songlin Yang, Ryan Lee, Mengxi Wu, Jonathan May, Luke Zettlemoyer, Carole-Jean Wu,
Abstract summary: We propose Gecko, a neural architecture that inherits the design of Mega and Megalodon.<n>In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability.
Score: 72.23218641825571
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to $4\times$ longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm

Related papers

LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling [27.045621004239067]
We propose LAWCAT, a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture.<n> LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths.<n>Our evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90% passkey retrieval accuracy up to 22K tokens.
arXiv Detail & Related papers (2025-09-22T22:43:44Z)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.<n>Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.<n>We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
Taipan: Efficient and Expressive State Space Language Models with Selective Attention [100.16383527459429]
Long-context language modeling is a significant challenge in Natural Language Processing (NLP) Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs) Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
arXiv Detail & Related papers (2024-10-24T09:25:37Z)
Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension [21.729875191721984]
We introduce correlation-aware selection and merging mechanisms to facilitate efficient sparse attention. We also propose a novel data augmentation technique involving positional encodings to enhance generalization to unseen positions. Our method achieves 100% accuracy on the passkey task with a context length of 4M and maintains stable perplexity at a 1M context length.
arXiv Detail & Related papers (2024-10-05T15:59:32Z)
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length [112.75694077842604]
We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens.
arXiv Detail & Related papers (2024-04-12T20:28:14Z)
LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z)
Hyena Hierarchy: Towards Larger Convolutional Language Models [115.82857881546089]
Hyena is a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods.
arXiv Detail & Related papers (2023-02-21T18:29:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.