Related papers: TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras

TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras

URL: http://arxiv.org/abs/2508.00913v1
Date: Tue, 29 Jul 2025 19:52:48 GMT
Title: TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras
Authors: Mohammad Mohammadi, Ziyi Wu, Igor Gilitschenski,
Abstract summary: Long-term temporal information is crucial for event-based perception tasks.<n>Current self-supervised pre-training methods largely mimic RGB image-based approaches.<n>We introduce TESPEC, a self-supervised pre-training framework tailored for learningtemporal information.
Score: 18.05887838800614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pre-training largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for learning spatio-temporal information. TESPEC is well-suited for recurrent models, as it is the first framework to leverage long event sequences during pre-training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-of-the-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: https://mhdmohammadi.github.io/TESPEC_webpage.

Related papers

Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events [25.348660233701708]
Event camera records data with high temporal resolution and wide dynamic range.<n>Event data is inherently sparse and noisy, mainly reflecting brightness changes.<n>We propose a self-supervised pre-training framework to fully reveal latent information in event data.
arXiv Detail & Related papers (2025-08-07T15:38:36Z)
MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z)
Timer: Generative Pre-trained Transformers Are Large Time Series Models [83.03091523806668]
This paper aims at the early development of large time series models (LTSM) During pre-training, we curate large-scale datasets with up to 1 billion time points. To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task.
arXiv Detail & Related papers (2024-02-04T06:55:55Z)
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders. Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z)
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras [41.263606382601886]
Masked Event Modeling (MEM) is a self-supervised framework for events. MEM pretrains a neural network on unlabeled events, which can originate from any event camera recording. Our method reaches state-of-the-art classification accuracy across three datasets.
arXiv Detail & Related papers (2022-12-20T15:49:56Z)
Improved skin lesion recognition by a Self-Supervised Curricular Deep Learning approach [0.0]
State-of-the-art deep learning approaches for skin lesion recognition often require pretraining on larger and more varied datasets. ImageNet is often used as the pretraining dataset, but its transferring potential is hindered by the domain gap between the source dataset and the target dermatoscopic scenario. In this work, we introduce a novel pretraining approach that sequentially trains a series of Self-Supervised Learning pretext tasks.
arXiv Detail & Related papers (2021-12-22T17:45:47Z)
Few-Cost Salient Object Detection with Adversarial-Paced Learning [95.0220555274653]
This paper proposes to learn the effective salient object detection model based on the manual annotation on a few training images only. We name this task as the few-cost salient object detection and propose an adversarial-paced learning (APL)-based framework to facilitate the few-cost learning scenario.
arXiv Detail & Related papers (2021-04-05T14:15:49Z)
Semi-supervised Facial Action Unit Intensity Estimation with Contrastive Learning [54.90704746573636]
Our method does not require to manually select key frames, and produces state-of-the-art results with as little as $2%$ of annotated frames. We experimentally validate that our method outperforms existing methods when working with as little as $2%$ of randomly chosen data.
arXiv Detail & Related papers (2020-11-03T17:35:57Z)
Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames. Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction. We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.