MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling
- URL: http://arxiv.org/abs/2603.03001v1
- Date: Tue, 03 Mar 2026 13:52:33 GMT
- Title: MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling
- Authors: Jinwoong Kim, Sangjin Park,
- Abstract summary: MaBERT is a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates.<n>On GLUE, MaBERT achieves the best mean score on five of eight tasks, with strong performance on the CoLA and sentence pair inference tasks.
- Score: 3.5795275871379704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the best mean score on five of the eight tasks, with strong performance on the CoLA and sentence pair inference tasks. When extending the context from 512 to 4,096 tokens, MaBERT reduces training time and inference latency by 2.36x and 2.43x, respectively, relative to the average of encoder baselines, demonstrating a practical long context efficient encoder.
Related papers
- Stateful Token Reduction for Long-Video Hybrid VLMs [69.6930118088911]
We study query-conditioned token reduction for hybrid video vision-language models (VLMs)<n>We propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks.<n>Under an aggressive compression setting, our approach delivers substantial prefilling speedups with near-baseline accuracy at test time.
arXiv Detail & Related papers (2026-02-27T08:11:06Z) - SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers [15.142822497807236]
We propose SCOUT, a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations.<n>SCOUT retains much of the expressivity of full attention while substantially reducing the computational and memory cost.<n>We analyze SCOUT's computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks.
arXiv Detail & Related papers (2025-08-31T17:08:33Z) - Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [108.0657508755532]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z) - Taipan: Efficient and Expressive State Space Language Models with Selective Attention [100.16383527459429]
Long-context language modeling is a significant challenge in Natural Language Processing (NLP)
Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval.
We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs)
Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
arXiv Detail & Related papers (2024-10-24T09:25:37Z) - LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity.
Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution.
Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.<n>With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses 50% of its compute only on the transformer encoder.<n>This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - StableMask: Refining Causal Masking in Decoder-only Transformer [22.75632485195928]
decoder-only Transformer architecture with causal masking and relative position encoding (RPE) has become the de facto choice in language modeling.
However, it requires all attention scores to be non-zero and sum up to 1, even if the current embedding has sufficient self-contained information.
We propose StableMask: a parameter-free method to address both limitations by refining the causal mask.
arXiv Detail & Related papers (2024-02-07T12:01:02Z) - MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences.
We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.