Diff-VPS: Video Polyp Segmentation via a Multi-task Diffusion Network with Adversarial Temporal Reasoning
- URL: http://arxiv.org/abs/2409.07238v1
- Date: Wed, 11 Sep 2024 12:51:41 GMT
- Title: Diff-VPS: Video Polyp Segmentation via a Multi-task Diffusion Network with Adversarial Temporal Reasoning
- Authors: Yingling Lu, Yijun Yang, Zhaohu Xing, Qiong Wang, Lei Zhu,
- Abstract summary: We present a novel diffusion-based network for video polyp segmentation task, dubbed as Diff-VPS.
We incorporate multi-task supervision into diffusion models to promote the discrimination of diffusion models on pixel-by-pixel segmentation.
To explore the temporal dependency, Temporal Reasoning Module (TRM) is devised via reasoning and reconstructing the target frame from the previous frames.
- Score: 12.37208687991656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Probabilistic Models have recently attracted significant attention in the community of computer vision due to their outstanding performance. However, while a substantial amount of diffusion-based research has focused on generative tasks, no work introduces diffusion models to advance the results of polyp segmentation in videos, which is frequently challenged by polyps' high camouflage and redundant temporal cues.In this paper, we present a novel diffusion-based network for video polyp segmentation task, dubbed as Diff-VPS. We incorporate multi-task supervision into diffusion models to promote the discrimination of diffusion models on pixel-by-pixel segmentation. This integrates the contextual high-level information achieved by the joint classification and detection tasks. To explore the temporal dependency, Temporal Reasoning Module (TRM) is devised via reasoning and reconstructing the target frame from the previous frames. We further equip TRM with a generative adversarial self-supervised strategy to produce more realistic frames and thus capture better dynamic cues. Extensive experiments are conducted on SUN-SEG, and the results indicate that our proposed Diff-VPS significantly achieves state-of-the-art performance. Code is available at https://github.com/lydia-yllu/Diff-VPS.
Related papers
- Diffusion-based Perceptual Neural Video Compression with Temporal Diffusion Information Reuse [45.134271969594614]
DiffVC is a diffusion-based perceptual neural video compression framework.
It integrates foundational diffusion model with the video conditional coding paradigm.
We show that our proposed solution delivers excellent performance in both perception metrics and visual quality.
arXiv Detail & Related papers (2025-01-23T10:23:04Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.
We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.
We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - DINTR: Tracking via Diffusion-based Interpolation [12.130669304428565]
This work proposes a novel diffusion-based methodology to formulate the tracking task.
Our INterpolation TrackeR (DINTR) presents a promising new paradigm and achieves a superior multiplicity on seven benchmarks across five indicator representations.
arXiv Detail & Related papers (2024-10-14T00:41:58Z) - Diffusion Models in Low-Level Vision: A Survey [82.77962165415153]
diffusion model-based solutions have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity.
We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models.
We summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios.
arXiv Detail & Related papers (2024-06-17T01:49:27Z) - Frame Interpolation with Consecutive Brownian Bridge Diffusion [21.17973023413981]
Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem.
We propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion.
arXiv Detail & Related papers (2024-05-09T17:46:22Z) - On the Multi-modal Vulnerability of Diffusion Models [56.08923332178462]
We propose MMP-Attack to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt.
Our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object.
arXiv Detail & Related papers (2024-02-02T12:39:49Z) - Exploring Vision Transformers as Diffusion Learners [15.32238726790633]
We systematically explore vision Transformers as diffusion learners for various generative tasks.
With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods.
We are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution.
arXiv Detail & Related papers (2022-12-28T10:32:59Z) - Versatile Diffusion: Text, Images and Variations All in One Diffusion
Model [76.89932822375208]
Versatile Diffusion handles multiple flows of text-to-image, image-to-text, and variations in one unified model.
Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
arXiv Detail & Related papers (2022-11-15T17:44:05Z) - A Survey on Generative Diffusion Model [75.93774014861978]
Diffusion models are an emerging class of deep generative models.
They have certain limitations, including a time-consuming iterative generation process and confinement to high-dimensional Euclidean space.
This survey presents a plethora of advanced techniques aimed at enhancing diffusion models.
arXiv Detail & Related papers (2022-09-06T16:56:21Z) - Label-Efficient Semantic Segmentation with Diffusion Models [27.01899943738203]
We demonstrate that diffusion models can also serve as an instrument for semantic segmentation.
In particular, for several pretrained diffusion models, we investigate the intermediate activations from the networks that perform the Markov step of the reverse diffusion process.
We show that these activations effectively capture the semantic information from an input image and appear to be excellent pixel-level representations for the segmentation problem.
arXiv Detail & Related papers (2021-12-06T15:55:30Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.