Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models
- URL: http://arxiv.org/abs/2405.16947v1
- Date: Mon, 27 May 2024 08:39:38 GMT
- Title: Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models
- Authors: Qian Wang, Abdelrahman Eldesokey, Mohit Mendiratta, Fangneng Zhan, Adam Kortylewski, Christian Theobalt, Peter Wonka,
- Abstract summary: We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models.
We propose a framework tailored for VSS based on pre-trained image and video diffusion models.
Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
- Score: 96.97910688908956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the first zero-shot approach for Video Semantic Segmentation (VSS) based on pre-trained diffusion models. A growing research direction attempts to employ diffusion models to perform downstream vision tasks by exploiting their deep understanding of image semantics. Yet, the majority of these approaches have focused on image-related tasks like semantic correspondence and segmentation, with less emphasis on video tasks such as VSS. Ideally, diffusion-based image semantic segmentation approaches can be applied to videos in a frame-by-frame manner. However, we find their performance on videos to be subpar due to the absence of any modeling of temporal information inherent in the video data. To this end, we tackle this problem and introduce a framework tailored for VSS based on pre-trained image and video diffusion models. We propose building a scene context model based on the diffusion features, where the model is autoregressively updated to adapt to scene changes. This context model predicts per-frame coarse segmentation maps that are temporally consistent. To refine these maps further, we propose a correspondence-based refinement strategy that aggregates predictions temporally, resulting in more confident predictions. Finally, we introduce a masked modulation approach to upsample the coarse maps to the full resolution at a high quality. Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches significantly on various VSS benchmarks without any training or fine-tuning. Moreover, it rivals supervised VSS approaches on the VSPW dataset despite not being explicitly trained for VSS.
Related papers
- Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation [56.87049651707208]
Few-shot Semantic has evolved into In-context tasks, morphing into a crucial element in assessing generalist segmentation models.
Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework.
Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework.
arXiv Detail & Related papers (2024-10-03T10:33:49Z) - CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion [35.26835471419003]
Video saliency prediction aims to identify regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition.
Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language.
In this paper, we propose CaRDiff, a framework that imitates the process by integrating a multimodal large language model (MLLM), a grounding module, and a diffusion model.
arXiv Detail & Related papers (2024-08-21T21:40:30Z) - Few-Shot Medical Image Segmentation with High-Fidelity Prototypes [38.073371773707514]
We propose a novel Detail Self-refined Prototype Network (DSPNet) to construct high-fidelity prototypes representing the object foreground and the background more comprehensively.
To construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner.
arXiv Detail & Related papers (2024-06-26T05:06:14Z) - Semantic Segmentation on VSPW Dataset through Masked Video Consistency [19.851665554201407]
We present our solution for the PVUW competition, where we introduce masked video (MVC) based on existing models.
MVC enforces consistency between predictions of masked random frames where patches are withheld.
Our approach achieves 67% mIoU performance on the VSPW dataset, ranking 2nd in the PVUW2024 VSS track.
arXiv Detail & Related papers (2024-06-07T14:41:24Z) - Training-Free Semantic Video Composition via Pre-trained Diffusion Model [96.0168609879295]
Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments.
We propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge.
Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs.
arXiv Detail & Related papers (2024-01-17T13:07:22Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - MeDM: Mediating Image Diffusion Models for Video-to-Video Translation
with Temporal Correspondence Guidance [10.457759140533168]
This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow.
We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores.
arXiv Detail & Related papers (2023-08-19T17:59:12Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning [8.7660229706359]
We propose a Bayesian variant of policy-gradient based reinforcement learning technique for image captioning models.
We extend the well-known Self-Critical Sequence Training (SCST) approach for image captioning models by incorporating Bayesian inference.
We show that B-SCST improves CIDEr-D scores on Flickr30k, MS COCO and VizWiz image captioning datasets, compared to the SCST approach.
arXiv Detail & Related papers (2020-04-06T07:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.