Related papers: FBS: Modeling Native Parallel Reading inside a Transformer

FBS: Modeling Native Parallel Reading inside a Transformer

URL: http://arxiv.org/abs/2601.21708v1
Date: Thu, 29 Jan 2026 13:39:55 GMT
Title: FBS: Modeling Native Parallel Reading inside a Transformer
Authors: Tongxi Wang,
Abstract summary: Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression.<n>We propose the textbfFovea-Block-Skip Transformer (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG)
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing acceleration methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train--test consistency for preview/skimming. We propose the \textbf{Fovea-Block-Skip Transformer} (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG). Across diverse benchmarks, FBS improves the quality-efficiency trade-off without increasing parameters, and ablations show the three modules are complementary.

Related papers

vLinear: A Powerful Linear Model for Multivariate Time Series Forecasting [28.587343014443576]
vecTrans is a lightweight module that utilizes a learnable vector to model multivariate correlations.<n>WFMLoss is an effective plug-and-play objective, consistently improving existing forecasters.
arXiv Detail & Related papers (2026-01-20T09:23:10Z)
Parallel Decoder Transformer: Model-Internal Parallel Decoding with Speculative Invariance via Note Conditioning [0.0]
We introduce the textbfParallel Decoder Transformer (PDT), a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model.<n>PDT achieves effective self-correction, reaching textbf77.8% precision in coverage prediction and recovering approximate serial semantics without modifying the trunk weights.
arXiv Detail & Related papers (2025-12-10T20:19:10Z)
OneTrans: Unified Feature Interaction and Sequence Modeling with One Transformer in Industrial Recommender [32.265739328468584]
OneTrans is a unified Transformer backbone that simultaneously performs user-behavior sequence modeling and feature interaction.<n>We show that OneTrans scales efficiently with increasing parameters, consistently outperforms strong baselines, and yields a 5.68% lift in per-user GMV in online A/B tests.
arXiv Detail & Related papers (2025-10-30T03:30:12Z)
BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers [2.5680214354539803]
We propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST)<n>BATR-FST progressively improves token representations and maintains a robust inductive bias for few-shot classification.<n>It achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.
arXiv Detail & Related papers (2025-09-16T07:33:21Z)
Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM) MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token. We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z)
SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many computation tasks.<n>We show that the dense connections can be replaced with a sparse block diagonal structure that supports larger expansion ratios.<n>We also propose the use of a lightweight, parameter-free, channel covariance attention mechanism as a parallel branch during training.
arXiv Detail & Related papers (2023-12-01T08:22:34Z)
RSF-Conv: Rotation-and-Scale Equivariant Fourier Parameterized Convolution for Retinal Vessel Segmentation [58.618797429661754]
We propose a rotation-and-scale equivariant Fourier parameterized convolution (RSF-Conv) specifically for retinal vessel segmentation. As a general module, RSF-Conv can be integrated into existing networks in a plug-and-play manner. To demonstrate the effectiveness of RSF-Conv, we also apply RSF-Conv+U-Net and RSF-Conv+Iter-Net to retinal artery/vein classification.
arXiv Detail & Related papers (2023-09-27T13:14:57Z)
TranSFormer: Slow-Fast Transformer for Machine Translation [52.12212173775029]
We present a textbfSlow-textbfFast two-stream learning model, referred to as TrantextbfSFormer. Our TranSFormer shows consistent BLEU improvements (larger than 1 BLEU point) on several machine translation benchmarks.
arXiv Detail & Related papers (2023-05-26T14:37:38Z)
Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order. In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z)
HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization. Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z)
PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result. Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs. We replace the self-attention sublayers with simple linear transformations that "mix" input tokens. The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z)
Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex. This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.