Related papers: Label-Efficient Data Augmentation with Video Diffusion Models for Guidewire Segmentation in Cardiac Fluoroscopy

Label-Efficient Data Augmentation with Video Diffusion Models for Guidewire Segmentation in Cardiac Fluoroscopy

URL: http://arxiv.org/abs/2412.16050v4
Date: Mon, 27 Jan 2025 21:13:10 GMT
Title: Label-Efficient Data Augmentation with Video Diffusion Models for Guidewire Segmentation in Cardiac Fluoroscopy
Authors: Shaoyan Pan, Yikang Liu, Lin Zhao, Eric Z. Chen, Xiao Chen, Terrence Chen, Shanhui Sun,
Abstract summary: Deep learning methods have demonstrated high accuracy and robustness in wire segmentation.<n>These methods require substantial datasets for generalizability.<n>We propose the Frame-consistency Diffusion Model (SF-VD) to generate large collections of labeled fluoroscopy videos.
Score: 16.62770246342126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The accurate segmentation of guidewires in interventional cardiac fluoroscopy videos is crucial for computer-aided navigation tasks. Although deep learning methods have demonstrated high accuracy and robustness in wire segmentation, they require substantial annotated datasets for generalizability, underscoring the need for extensive labeled data to enhance model performance. To address this challenge, we propose the Segmentation-guided Frame-consistency Video Diffusion Model (SF-VD) to generate large collections of labeled fluoroscopy videos, augmenting the training data for wire segmentation networks. SF-VD leverages videos with limited annotations by independently modeling scene distribution and motion distribution. It first samples the scene distribution by generating 2D fluoroscopy images with wires positioned according to a specified input mask, and then samples the motion distribution by progressively generating subsequent frames, ensuring frame-to-frame coherence through a frame-consistency strategy. A segmentation-guided mechanism further refines the process by adjusting wire contrast, ensuring a diverse range of visibility in the synthesized image. Evaluation on a fluoroscopy dataset confirms the superior quality of the generated videos and shows significant improvements in guidewire segmentation.

Related papers

PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models [35.59605874012795]
PropFly is a training pipeline for propagation-based video editing.<n>PropFly relies on pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.<n>Our pipeline enables an adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss.
arXiv Detail & Related papers (2026-02-24T06:11:08Z)
UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models [67.24086328473437]
Event cameras excel at recording relative intensity changes rather than absolute intensity.<n>The resulting data streams suffer from a significant loss of spatial information and static texture details.<n>We address this limitation by leveraging a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data.
arXiv Detail & Related papers (2026-02-22T14:06:49Z)
Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos [13.824335238443334]
DRIFT is a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement.<n>We demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation.<n>Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement.
arXiv Detail & Related papers (2025-11-25T05:21:23Z)
HRVVS: A High-resolution Video Vasculature Segmentation Network via Hierarchical Autoregressive Residual Priors [18.951871257229055]
We introduce a high quality frame-by-frame annotated hepatic vasculature dataset containing 35 long hepatectomy videos and 11442 high-resolution frames.<n>We propose a novel high-resolution video vasculature segmentation network, dubbed as HRVVS.<n>Our proposed HRVVS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2025-07-30T09:57:38Z)
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation [55.01077993490845]
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling.<n>We introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework.
arXiv Detail & Related papers (2025-06-20T07:46:40Z)
AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset. Our model achieves 8.5x improvements in generation speed compared to the teacher model. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z)
Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow. We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z)
MSEG-VCUQ: Multimodal SEGmentation with Enhanced Vision Foundation Models, Convolutional Neural Networks, and Uncertainty Quantification for High-Speed Video Phase Detection Data [0.0]
High-speed video (HSV) phase detection (PD) segmentation is crucial for monitoring vapor, liquid, and microlayer phases in industrial processes. CNN-based models like U-Net have shown success in simplified shadowgraphy-based two-phase flow (TPF) analysis. MSEG-VCUQ integrates U-Net CNNs with the transformer-based Segment Anything Model (SAM) to achieve enhanced segmentation accuracy and cross-modality generalization.
arXiv Detail & Related papers (2024-11-12T00:54:26Z)
DeNVeR: Deformable Neural Vessel Representations for Unsupervised Video Vessel Segmentation [3.1977656204331684]
Deformable Neural Vessel Representations (DeNVeR) is an unsupervised approach for vessel segmentation in X-ray angiography videos without annotated ground truth.<n>Key contributions include a novel layer bootstrapping technique, a parallel vessel motion loss, and the integration of Eulerian motion fields for modeling complex vessel dynamics.
arXiv Detail & Related papers (2024-06-03T17:59:34Z)
Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models. We propose a framework tailored for VSS based on pre-trained image and video diffusion models. Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z)
Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames. It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance [10.457759140533168]
This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores.
arXiv Detail & Related papers (2023-08-19T17:59:12Z)
Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners. DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders. Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z)
Domain Adaptive Video Segmentation via Temporal Pseudo Supervision [46.38660541271893]
Video semantic segmentation can mitigate data labelling constraints by adapting from a labelled source domain toward an unlabelled target domain. We design temporal pseudo supervision (TPS), a simple and effective method that explores the idea of consistency training for representations effective from target videos. We show that TPS is simpler to implement, much more stable to train, and achieves superior video accuracy as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-06T00:36:14Z)
Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in VIS and NIR Scenario [87.72258480670627]
Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images. This paper proposes a Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation.
arXiv Detail & Related papers (2022-07-05T09:27:53Z)
Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring. FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames. Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z)
Temporally stable video segmentation without video annotations [6.184270985214255]
We introduce a method to adapt still image segmentation models to video in an unsupervised manner. We verify that the consistency measure is well correlated with human judgement via a user study. We observe improvements in the generated segmented videos with minimal loss of accuracy.
arXiv Detail & Related papers (2021-10-17T18:59:11Z)
Improving Semantic Segmentation through Spatio-Temporal Consistency Learned from Videos [39.25927216187176]
We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve single-image semantic segmentation. The predicted depth, egomotion, and camera intrinsics are used to provide an additional supervision signal to the segmentation model.
arXiv Detail & Related papers (2020-04-11T07:09:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.