Temporal-Guided Visual Foundation Models for Event-Based Vision
- URL: http://arxiv.org/abs/2511.06238v1
- Date: Sun, 09 Nov 2025 05:45:25 GMT
- Title: Temporal-Guided Visual Foundation Models for Event-Based Vision
- Authors: Ruihao Xia, Junhong Cai, Luziwei Leng, Liuyi Wang, Chengju Liu, Ran Cheng, Yang Tang, Pan Zhou,
- Abstract summary: Event cameras offer unique advantages for vision tasks in challenging environments.<n>The potential of leveraging modern Visual Foundation Models pretrained on image data remains under-explored for event-based vision.<n>We propose TemporalGuided-FM (TGVFM), a novel framework that integrates Visual Foundation Models with our temporal context fusion.
- Score: 40.997738547677066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under-explored for event-based vision. To address this, we propose Temporal-Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long-Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi-scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic-temporal features. By retraining event-to-video models on real-world data and leveraging transformer-based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross-modality potential of image-based VFMs for event-based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.
Related papers
- VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones [35.2847156993469]
VisonTS++ is a TSFM based on continual pre-training of a vision model on large-scale time series.<n>Our approach introduces three key innovations: vision-model-based filtering, colorized multivariate conversion, and multi-quantile forecasting.<n> Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting.
arXiv Detail & Related papers (2025-08-06T12:17:09Z) - DAMS:Dual-Branch Adaptive Multiscale Spatiotemporal Framework for Video Anomaly Detection [7.117824587276951]
This study offers a dual-path architecture called the Dual-Branch Adaptive Multiscale Stemporal Framework (DAMS), which is based on multilevel feature and decoupling fusion.<n>The main processing path integrates the Adaptive Multiscale Time Pyramid Network (AMTPN) with the Convolutional Block Attention Mechanism (CBAM)
arXiv Detail & Related papers (2025-07-28T08:42:00Z) - Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting [12.591771385493509]
We present textST-VFM, a framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose robustness-temporal forecasting.<n>The framework integrates raw inputs with auxiliary ST flow, where the flow encodes lightweight temporal difference signals interpretable as dynamic cues.<n>The emphpre-VFM reprogramming applies a Temporal-Aware Token to align both branches into VFM-compatible feature spaces.<n>The emphpost-VFM reprogramming introduces a Bilateral CrossPrompt Coordination module, enabling dynamic interaction between branches.
arXiv Detail & Related papers (2025-07-14T08:33:34Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z) - Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction [23.493813870675197]
Event-based video reconstruction has garnered increasing attention due to its advantages, such as high dynamic range and rapid motion capture capabilities.
Current methods often prioritize the extraction of temporal information from continuous event flow, leading to an overemphasis on low-frequency texture features in the scene.
We introduce a novel approach, the Temporal Residual Guided Diffusion Framework, which effectively leverages both temporal and frequency-based event priors.
arXiv Detail & Related papers (2024-07-15T11:48:57Z) - ViTime: Foundation Model for Time Series Forecasting Powered by Vision Intelligence [49.60944381032587]
Time series forecasting (TSF) possesses great practical values in various fields, including power and energy, transportation, etc.<n>TSF models have long been known to be problem-specific and lacking application generalizability.<n>This paper proposes a vision intelligence-powered framework, ViTime, for the first time.
arXiv Detail & Related papers (2024-07-10T02:11:01Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - Recurrent Vision Transformers for Object Detection with Event Cameras [62.27246562304705]
We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras.
RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection.
Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.
arXiv Detail & Related papers (2022-12-11T20:28:59Z) - RGB-Event Fusion for Moving Object Detection in Autonomous Driving [3.5397758597664306]
Moving Object Detection (MOD) is a critical vision task for successfully achieving safe autonomous driving.
Recent advances in sensor technologies, especially the Event camera, can naturally complement the conventional camera approach to better model moving objects.
We propose RENet, a novel RGB-Event fusion Network, that jointly exploits the two complementary modalities to achieve more robust MOD.
arXiv Detail & Related papers (2022-09-17T12:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.