Studying Image Diffusion Features for Zero-Shot Video Object Segmentation
- URL: http://arxiv.org/abs/2504.05468v1
- Date: Mon, 07 Apr 2025 19:58:25 GMT
- Title: Studying Image Diffusion Features for Zero-Shot Video Object Segmentation
- Authors: Thanos Delatolas, Vicky Kalogeiton, Dim P. Papadopoulos,
- Abstract summary: This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object (ZS-VOS)<n>We find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS.<n>Our approach performs on par with models trained on expensive image segmentation datasets.
- Score: 9.79891280451409
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.
Related papers
- Video Dataset Condensation with Diffusion Models [7.44997213284633]
Video dataset distillation is a promising solution to generate a compact synthetic dataset that retains the essential information from a large real dataset.<n>In this paper, we focus on video dataset distillation by employing a video diffusion model to generate high-quality synthetic videos.<n>To enhance representativeness, we introduce Video Spatio-Temporal U-Net (VST-UNet), a model designed to select a diverse and informative subset of videos.<n>We validate the effectiveness of our approach through extensive experiments on four benchmark datasets, demonstrating performance improvements of up to (10.61%) over the state-of-the
arXiv Detail & Related papers (2025-05-10T15:12:19Z) - SpecDM: Hyperspectral Dataset Synthesis with Pixel-level Semantic Annotations [27.391859339238906]
In this paper, we explore the potential of generative diffusion model in synthesizing hyperspectral images with pixel-level annotations.<n>To the best of our knowledge, it is the first work to generate high-dimensional HSIs with annotations.<n>We select two of the most widely used dense prediction tasks: semantic segmentation and change detection, and generate datasets suitable for these tasks.
arXiv Detail & Related papers (2025-02-24T11:13:37Z) - High-Precision Dichotomous Image Segmentation via Probing Diffusion Capacity [69.32473738284374]
Diffusion models have revolutionized text-to-image synthesis by delivering exceptional quality, fine detail resolution, and strong contextual awareness.
We propose DiffDIS, a diffusion-driven segmentation model that taps into the potential of the pre-trained U-Net within diffusion models.
Experiments on the DIS5K dataset demonstrate the superiority of DiffDIS, achieving state-of-the-art results through a streamlined inference process.
arXiv Detail & Related papers (2024-10-14T02:49:23Z) - Deep learning for fast segmentation and critical dimension metrology & characterization enabling AR/VR design and fabrication [0.0]
We report on the fine-tuning of a pre-trained Segment Anything Model (SAM) using a diverse dataset of electron microscopy images.
We employ methods such as low-rank adaptation (LoRA) to reduce training time and enhance the accuracy of ROI extraction.
The model's ability to generalize to unseen images facilitates zero-shot learning and supports a CD extraction model.
arXiv Detail & Related papers (2024-09-20T23:54:58Z) - Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models.
We propose a framework tailored for VSS based on pre-trained image and video diffusion models.
Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - An Advanced Features Extraction Module for Remote Sensing Image Super-Resolution [0.5461938536945723]
We propose an advanced feature extraction module called Channel and Spatial Attention Feature Extraction (CSA-FE)
Our proposed method helps the model focus on the specific channels and spatial locations containing high-frequency information so that the model can focus on relevant features and suppress irrelevant ones.
Our model achieved superior performance compared to various existing models.
arXiv Detail & Related papers (2024-05-07T18:15:51Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - Contrastive Multiview Coding with Electro-optics for SAR Semantic
Segmentation [0.6445605125467573]
We propose multi-modal representation learning for SAR semantic segmentation.
Unlike previous studies, our method jointly uses EO imagery, SAR imagery, and a label mask.
Several experiments show that our approach is superior to the existing methods in model performance, sample efficiency, and convergence speed.
arXiv Detail & Related papers (2021-08-31T23:55:41Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.