Scaling Dense Event-Stream Pretraining from Visual Foundation Models
- URL: http://arxiv.org/abs/2603.03969v1
- Date: Wed, 04 Mar 2026 12:06:09 GMT
- Title: Scaling Dense Event-Stream Pretraining from Visual Foundation Models
- Authors: Zhiwen Chen, Junhui Hou, Zhiyu Zhu, Jinjian Wu, Guangming Shi,
- Abstract summary: We launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale.<n>We curate an extensive synchronized image-event collection to amplify cross-modal alignment.<n>We extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision.
- Score: 112.44243079477137
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability.
Related papers
- Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models [84.78794648147608]
A persistent geometric anomaly, the Modality Gap, remains.<n>Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions.<n>We propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap into stable biases and anisotropic residuals.<n>We then introduce ReAlign, a training-free modality alignment strategy.
arXiv Detail & Related papers (2026-02-02T13:59:39Z) - Toward Generalizable Deblurring: Leveraging Massive Blur Priors with Linear Attention for Real-World Scenarios [9.82847623835017]
GLOWDeblur is a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction & domain alignment module with a lightweight diffusion backbone.<n>We propose Blur Pattern Pretraining (BPP), which acquires blur priors from simulation datasets and transfers them through joint fine-tuning on real data.<n>We further introduce Motion and Semantic Guidance (MoSeG) to strengthen blur priors under severe degradation, and integrate it into GLOWDeblur, a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction &
arXiv Detail & Related papers (2026-01-10T11:01:31Z) - Semantic-guided Fine-tuning of Foundation Model for Long-tailed Visual Recognition [38.74388860692423]
We propose a novel approach, Semantic-guided fine-tuning of foundation model for long-tailed visual recognition (Sage)<n>We introduce an SG-Adapter that integrates class descriptions as semantic guidance to guide the fine-tuning of the visual encoder.<n>Experiments on benchmark datasets demonstrate the effectiveness of the proposed Sage in enhancing performance in long-tailed learning.
arXiv Detail & Related papers (2025-07-17T05:47:19Z) - G4Seg: Generation for Inexact Segmentation Refinement with Diffusion Models [38.44872934965588]
This paper considers the problem of utilizing a large-scale text-to-image model to tackle the Inexact diffusion (IS) task.<n>We exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement.
arXiv Detail & Related papers (2025-06-02T11:05:28Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - High-Precision Dichotomous Image Segmentation via Probing Diffusion Capacity [69.32473738284374]
Diffusion models have revolutionized text-to-image synthesis by delivering exceptional quality, fine detail resolution, and strong contextual awareness.<n>We propose DiffDIS, a diffusion-driven segmentation model that taps into the potential of the pre-trained U-Net within diffusion models.<n>Experiments on the DIS5K dataset demonstrate the superiority of DiffDIS, achieving state-of-the-art results through a streamlined inference process.
arXiv Detail & Related papers (2024-10-14T02:49:23Z) - Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation [56.87049651707208]
Few-shot Semantic has evolved into In-context tasks, morphing into a crucial element in assessing generalist segmentation models.
Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework.
Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework.
arXiv Detail & Related papers (2024-10-03T10:33:49Z) - Robust Latent Representation Tuning for Image-text Classification [9.789498730131607]
We propose a robust latent representation tuning method for large models.
Our approach introduces a modality latent translation module to maximize the correlation between modalities, resulting in a robust representation.
Within this framework, common semantics are refined during training, and robust performance is achieved even in the absence of one modality.
arXiv Detail & Related papers (2024-06-10T06:29:00Z) - Multi-Class Segmentation from Aerial Views using Recursive Noise Diffusion [16.86600007830682]
We propose an end-to-end multi-class semantic segmentation diffusion model.
Our method achieves promising results on the UAVid dataset.
Being the first iteration of this method, it shows great promise for future improvements.
arXiv Detail & Related papers (2022-12-01T18:59:26Z) - Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames.
Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction.
We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.