Diff-VPS: Video Polyp Segmentation via a Multi-task Diffusion Network with Adversarial Temporal Reasoning
- URL: http://arxiv.org/abs/2409.07238v1
- Date: Wed, 11 Sep 2024 12:51:41 GMT
- Title: Diff-VPS: Video Polyp Segmentation via a Multi-task Diffusion Network with Adversarial Temporal Reasoning
- Authors: Yingling Lu, Yijun Yang, Zhaohu Xing, Qiong Wang, Lei Zhu,
- Abstract summary: We present a novel diffusion-based network for video polyp segmentation task, dubbed as Diff-VPS.
We incorporate multi-task supervision into diffusion models to promote the discrimination of diffusion models on pixel-by-pixel segmentation.
To explore the temporal dependency, Temporal Reasoning Module (TRM) is devised via reasoning and reconstructing the target frame from the previous frames.
- Score: 12.37208687991656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Probabilistic Models have recently attracted significant attention in the community of computer vision due to their outstanding performance. However, while a substantial amount of diffusion-based research has focused on generative tasks, no work introduces diffusion models to advance the results of polyp segmentation in videos, which is frequently challenged by polyps' high camouflage and redundant temporal cues.In this paper, we present a novel diffusion-based network for video polyp segmentation task, dubbed as Diff-VPS. We incorporate multi-task supervision into diffusion models to promote the discrimination of diffusion models on pixel-by-pixel segmentation. This integrates the contextual high-level information achieved by the joint classification and detection tasks. To explore the temporal dependency, Temporal Reasoning Module (TRM) is devised via reasoning and reconstructing the target frame from the previous frames. We further equip TRM with a generative adversarial self-supervised strategy to produce more realistic frames and thus capture better dynamic cues. Extensive experiments are conducted on SUN-SEG, and the results indicate that our proposed Diff-VPS significantly achieves state-of-the-art performance. Code is available at https://github.com/lydia-yllu/Diff-VPS.
Related papers
- DINTR: Tracking via Diffusion-based Interpolation [12.130669304428565]
This work proposes a novel diffusion-based methodology to formulate the tracking task.
Our INterpolation TrackeR (DINTR) presents a promising new paradigm and achieves a superior multiplicity on seven benchmarks across five indicator representations.
arXiv Detail & Related papers (2024-10-14T00:41:58Z) - Diffusion Models in Low-Level Vision: A Survey [82.77962165415153]
diffusion model-based solutions have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity.
We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models.
We summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios.
arXiv Detail & Related papers (2024-06-17T01:49:27Z) - Frame Interpolation with Consecutive Brownian Bridge Diffusion [21.17973023413981]
Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem.
We propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion.
arXiv Detail & Related papers (2024-05-09T17:46:22Z) - Diffusion-TS: Interpretable Diffusion for General Time Series Generation [6.639630994040322]
Diffusion-TS is a novel diffusion-based framework that generates time series samples of high quality.
We train the model to directly reconstruct the sample instead of the noise in each diffusion step, combining a Fourier-based loss term.
Results show that Diffusion-TS achieves the state-of-the-art results on various realistic analyses of time series.
arXiv Detail & Related papers (2024-03-04T05:39:23Z) - Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining.
We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z) - EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion [60.30030562932703]
EpiDiff is a localized interactive multiview diffusion model.
It generates 16 multiview images in just 12 seconds.
It surpasses previous methods in quality evaluation metrics.
arXiv Detail & Related papers (2023-12-11T05:20:52Z) - Exploring Vision Transformers as Diffusion Learners [15.32238726790633]
We systematically explore vision Transformers as diffusion learners for various generative tasks.
With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods.
We are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution.
arXiv Detail & Related papers (2022-12-28T10:32:59Z) - Versatile Diffusion: Text, Images and Variations All in One Diffusion
Model [76.89932822375208]
Versatile Diffusion handles multiple flows of text-to-image, image-to-text, and variations in one unified model.
Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
arXiv Detail & Related papers (2022-11-15T17:44:05Z) - A Survey on Generative Diffusion Model [75.93774014861978]
Diffusion models are an emerging class of deep generative models.
They have certain limitations, including a time-consuming iterative generation process and confinement to high-dimensional Euclidean space.
This survey presents a plethora of advanced techniques aimed at enhancing diffusion models.
arXiv Detail & Related papers (2022-09-06T16:56:21Z) - Label-Efficient Semantic Segmentation with Diffusion Models [27.01899943738203]
We demonstrate that diffusion models can also serve as an instrument for semantic segmentation.
In particular, for several pretrained diffusion models, we investigate the intermediate activations from the networks that perform the Markov step of the reverse diffusion process.
We show that these activations effectively capture the semantic information from an input image and appear to be excellent pixel-level representations for the segmentation problem.
arXiv Detail & Related papers (2021-12-06T15:55:30Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.