Related papers: Subjective Depth and Timescale Transformers: Learning Where and When to Compute

Subjective Depth and Timescale Transformers: Learning Where and When to Compute

URL: http://arxiv.org/abs/2511.21408v1
Date: Wed, 26 Nov 2025 14:00:18 GMT
Title: Subjective Depth and Timescale Transformers: Learning Where and When to Compute
Authors: Frederico Wieser, Martin Benfeghoul, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas,
Abstract summary: We introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT)<n>SDT and STT leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs.<n>The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.
Score: 15.164635408299304
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block 'posterior' and a lightweight 'prior,' while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a temporal 'change hypothesis' that informs a router to dynamically execute or bypass TF blocks for each token, managing KV-cache contributions. Both architectures exhibit the predicted shift from novelty to prediction driven gating over training, suggesting alignment with surprise based principles. While operating at reduced capacity, they offer preliminary insights into the compute-accuracy trade-offs of conditional computation. The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.

Related papers

LiQSS: Post-Transformer Linear Quantum-Inspired State-Space Tensor Networks for Real-Time 6G [85.58816960936069]
Proactive and agentic control in Sixth-Generation (6G) Open Radio Access Networks (O-RAN) requires control-grade prediction under stringent Near-Time (Near-RT) latency and computational constraints.<n>This paper investigates a post-Transformer paradigm for efficient radio telemetry forecasting.<n>We propose a quantum-inspired state-space tensor network that replaces self-attention with stable structured state-space dynamics kernels.
arXiv Detail & Related papers (2026-01-18T12:08:38Z)
Rethinking Vision Transformer Depth via Structural Reparameterization [16.12815682992294]
We propose a branch-based structural reparameterization technique that operates during the training phase.<n>Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models.<n>When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K.
arXiv Detail & Related papers (2025-11-24T21:28:55Z)
STAS: Spatio-Temporal Adaptive Computation Time for Spiking Transformers [5.234835661080496]
Spiking neural networks (SNNs) offer energy efficiency over artificial neural networks (ANNs) but suffer from high latency and computational overhead.<n>We propose STAS (Spatio-Temporal Adaptive computation time for Spiking transformers), a framework that co-designs the static architecture and dynamic computation policy.
arXiv Detail & Related papers (2025-08-19T13:18:21Z)
Echo State Transformer: Attention Over Finite Memories [2.118933003468525]
We introduce Echo State Transformers (EST), a hybrid architecture that elegantly resolves the challenge of sequential data processing.<n>EST integrates the Transformer attention mechanisms with principles from Reservoir Computing to create a fixed-size window distributed memory system.<n>EST ranks first overall in two of five categories, outperforming strong state-of-the-art baselines on classification and anomaly detection tasks.
arXiv Detail & Related papers (2025-06-25T09:56:25Z)
DyTTP: Trajectory Prediction with Normalization-Free Transformers [0.0]
Transformer-based architectures have demonstrated significant promise in capturing complex robustnessity dependencies.<n>We present a two-fold approach to address these challenges.<n>First, we integrate DynamicTanh (DyT), which is the latest method to promote transformers, into the backbone, replacing traditional layer normalization.<n>We are the first work to deploy the DyT to the trajectory prediction task.
arXiv Detail & Related papers (2025-04-07T09:26:25Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures. This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead. We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z)
FormerTime: Hierarchical Multi-Scale Representations for Multivariate Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task. It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
TCCT: Tightly-Coupled Convolutional Transformer on Time Series Forecasting [6.393659160890665]
We propose the concept of tightly-coupled convolutional Transformer(TCCT) and three TCCT architectures. Our experiments on real-world datasets show that our TCCT architectures could greatly improve the performance of existing state-of-art Transformer models.
arXiv Detail & Related papers (2021-08-29T08:49:31Z)
Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC) We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.