Time-step Mixup for Efficient Spiking Knowledge Transfer from Appearance to Event Domain
- URL: http://arxiv.org/abs/2509.12959v1
- Date: Tue, 16 Sep 2025 11:02:36 GMT
- Title: Time-step Mixup for Efficient Spiking Knowledge Transfer from Appearance to Event Domain
- Authors: Yuqi Xie, Shuhan Ye, Chong Wang, Jiazhen Xu, Le Shen, Yuanbin Qian, Jiangbo Qian,
- Abstract summary: Time-step Mixup knowledge transfer exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time-steps.<n>Our approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks.
- Score: 9.691720154439375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The integration of event cameras and spiking neural networks holds great promise for energy-efficient visual processing. However, the limited availability of event data and the sparse nature of DVS outputs pose challenges for effective training. Although some prior work has attempted to transfer semantic knowledge from RGB datasets to DVS, they often overlook the significant distribution gap between the two modalities. In this paper, we propose Time-step Mixup knowledge transfer (TMKT), a novel fine-grained mixing strategy that exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time-steps. To enable label mixing in cross-modal scenarios, we further introduce modality-aware auxiliary learning objectives. These objectives support the time-step mixup process and enhance the model's ability to discriminate effectively across different modalities. Our approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks. Extensive experiments demonstrate the effectiveness of our method across multiple datasets. The code will be released after the double-blind review process.
Related papers
- Breaking the Modality Wall: Time-step Mixup for Efficient Spiking Knowledge Transfer from Static to Event Domain [28.746290149091706]
Time-step Mixup Knowledge Transfer (TMKT) is a cross-modal training framework with a probabilistic Time-step Mixup (TSM) strategy.<n>TMKT exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time steps to produce a smooth curriculum.<n>TMKT enables smoother knowledge transfer, helps mitigate modality mismatch during training, and achieves superior performance in spiking image classification tasks.
arXiv Detail & Related papers (2025-11-15T10:31:22Z) - Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection [56.88160531995454]
Existing RGB-Event detection methods process the low-information regions of both modalities uniformly during feature extraction and fusion.<n>We propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features.<n>Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency.
arXiv Detail & Related papers (2025-09-04T04:18:46Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z) - Cross-modal Orthogonal High-rank Augmentation for RGB-Event
Transformer-trackers [58.802352477207094]
We explore the great potential of a pre-trained vision Transformer (ViT) to bridge the vast distribution gap between two modalities.
We propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively.
Experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two trackersstream to a large extent in terms of both tracking precision and success rate.
arXiv Detail & Related papers (2023-07-09T08:58:47Z) - A Dimensional Structure based Knowledge Distillation Method for
Cross-Modal Learning [15.544134849816528]
We discover the correlation between feature discriminability and dimensional structure (DS) by analyzing and observing features extracted from simple and hard tasks.
We propose a novel cross-modal knowledge distillation (CMKD) method for better supervised cross-modal learning (CML) performance.
The proposed method enforces output features to be channel-wise independent and intermediate ones to be uniformly distributed, thereby learning semantically irrelevant features from the hard task to boost its accuracy.
arXiv Detail & Related papers (2023-06-28T07:29:26Z) - Surface EMG-Based Inter-Session/Inter-Subject Gesture Recognition by
Leveraging Lightweight All-ConvNet and Transfer Learning [17.535392299244066]
Gesture recognition using low-resolution instantaneous HD-sEMG images opens up new avenues for the development of more fluid and natural muscle-computer interfaces.
The data variability between inter-session and inter-subject scenarios presents a great challenge.
Existing approaches employed very large and complex deep ConvNet or 2SRNN-based domain adaptation methods to approximate the distribution shift caused by these inter-session and inter-subject data variability.
We propose a lightweight All-ConvNet+TL model that leverages lightweight All-ConvNet and transfer learning (TL) for the enhancement of inter-session and inter-subject gesture recognition
arXiv Detail & Related papers (2023-05-13T21:47:55Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Modality Compensation Network: Cross-Modal Adaptation for Action
Recognition [77.24983234113957]
We propose a Modality Compensation Network (MCN) to explore the relationships of different modalities.
Our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning.
Experimental results reveal that MCN outperforms state-of-the-art approaches on four widely-used action recognition benchmarks.
arXiv Detail & Related papers (2020-01-31T04:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.