TDFormer: A Top-Down Attention-Controlled Spiking Transformer
- URL: http://arxiv.org/abs/2505.15840v2
- Date: Fri, 23 May 2025 03:45:06 GMT
- Title: TDFormer: A Top-Down Attention-Controlled Spiking Transformer
- Authors: Zizheng Zhu, Yingchao Yu, Zeqi Zheng, Zhaofei Yu, Yaochu Jin,
- Abstract summary: We introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically.<n>We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets.<n>In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.
- Score: 33.07648914591285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional spiking neural networks (SNNs) can be viewed as a combination of multiple subnetworks with each running for one time step, where the parameters are shared, and the membrane potential serves as the only information link between them. However, the implicit nature of the membrane potential limits its ability to effectively represent temporal information. As a result, each time step cannot fully leverage information from previous time steps, seriously limiting the model's performance. Inspired by the top-down mechanism in the brain, we introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically and leverages high-order representations from earlier time steps to modulate the processing of low-order information at later stages. The feedback structure plays a role from two perspectives: 1) During forward propagation, our model increases the mutual information across time steps, indicating that richer temporal information is being transmitted and integrated in different time steps. 2) During backward propagation, we theoretically prove that the feedback structure alleviates the problem of vanishing gradients along the time dimension. We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets. In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.
Related papers
- Adaptive Fuzzy Time Series Forecasting via Partially Asymmetric Convolution and Sub-Sliding Window Fusion [0.0]
We propose a novel convolutional architecture with partially asymmetric design based on the time of sliding window.<n>The proposed method achieves state-of-the-art results on most of popular time series datasets.
arXiv Detail & Related papers (2025-07-28T08:58:25Z) - TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning [6.329214318116305]
We propose a memory-efficient Temporal Difference Side Network ( TDS-CLIP) to balance knowledge transferring and temporal modeling.
Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features.
We also designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos.
arXiv Detail & Related papers (2024-08-20T09:40:08Z) - Mutual Information-driven Triple Interaction Network for Efficient Image
Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing.
The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal.
The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Mutual Information-Based Temporal Difference Learning for Human Pose
Estimation in Video [16.32910684198013]
We present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts.
To be specific, we design a multi-stage entangled learning sequences conditioned on multi-stage differences to derive informative motion representation sequences.
These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark HiEve.
arXiv Detail & Related papers (2023-03-15T09:29:03Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Gradient Forward-Propagation for Large-Scale Temporal Video Modelling [13.665160620951777]
Backpropagation blocks computations until the forward and backward passes are completed.
For temporal signals, this introduces high latency and hinders real-time learning.
In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time.
We show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training.
arXiv Detail & Related papers (2021-06-15T17:50:22Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - Temporal Convolutional Attention-based Network For Sequence Modeling [13.972755301732656]
We propose an exploratory architecture referred to Temporal Convolutional Attention-based Network (TCAN)
TCAN combines temporal convolutional network and attention mechanism.
We improve the state-of-the-art results of bpc/perplexity to 30.28 on word-level PTB, 1.092 on character-level PTB, and 9.20 on WikiText-2.
arXiv Detail & Related papers (2020-02-28T03:53:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.