Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models
- URL: http://arxiv.org/abs/2505.09858v1
- Date: Wed, 14 May 2025 23:43:29 GMT
- Title: Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models
- Authors: Danush Kumar Venkatesh, Isabel Funke, Micha Pfeiffer, Fiona Kolbinger, Hanna Maria Schmeiser, Juergen Weitz, Marius Distler, Stefanie Speidel,
- Abstract summary: We propose a two-stage, text-based method to generate high-fidelity surgical videos for under-represented classes.<n>We evaluate our method on two downstream tasks--action recognition and intra-operative event prediction-demonstrating.
- Score: 1.5678321653327674
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Computer-assisted interventions can improve intra-operative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a unique two-stage, text-conditioned diffusion-based method to generate high-fidelity surgical videos for under-represented classes. Our approach conditions the generation process on text prompts and decouples spatial and temporal modeling by utilizing a 2D latent diffusion model to capture spatial content and then integrating temporal attention layers to ensure temporal consistency. Furthermore, we introduce a rejection sampling strategy to select the most suitable synthetic samples, effectively augmenting existing datasets to address class imbalance. We evaluate our method on two downstream tasks-surgical action recognition and intra-operative event prediction-demonstrating that incorporating synthetic videos from our approach substantially enhances model performance. We open-source our implementation at https://gitlab.com/nct_tso_public/surgvgen.
Related papers
- Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models [56.2236083600999]
We propose a novel hierarchical input-dependent state space model for surgical video analysis.<n>Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information.<n> Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-06-26T14:43:57Z) - HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation [44.37374628674769]
We propose HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models.<n>The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.
arXiv Detail & Related papers (2025-06-26T14:07:23Z) - SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution [55.14432034345353]
We study key design principles for latter cascaded video super-resolution models, which are underexplored currently.<n>First, we propose two strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator.<n>Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs.
arXiv Detail & Related papers (2025-06-24T17:57:26Z) - AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset.<n>Our model achieves 8.5x improvements in generation speed compared to the teacher model.<n>Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z) - Temporal-Consistent Video Restoration with Pre-trained Diffusion Models [51.47188802535954]
Video restoration (VR) aims to recover high-quality videos from degraded ones.<n>Recent zero-shot VR methods using pre-trained diffusion models (DMs) suffer from approximation errors during reverse diffusion and insufficient temporal consistency.<n>We present a novel a Posterior Maximum (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors.
arXiv Detail & Related papers (2025-03-19T03:41:56Z) - Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks [0.35087986342428684]
We introduce diffusion-based temporal models that capture the dynamics of fine-grained robotic sub-stitch actions.<n>We fine-tune two state-of-the-art video diffusion models to generate high-fidelity surgical action sequences at $ge$Lox resolution and $ge$49 frames.<n>Our experimental results demonstrate that these world models can effectively capture the dynamics of suturing, potentially enabling improved training, skill assessment tools, and autonomous surgical systems.
arXiv Detail & Related papers (2025-03-16T14:51:12Z) - GAUDA: Generative Adaptive Uncertainty-guided Diffusion-based Augmentation for Surgical Segmentation [1.0808810256442274]
We learn semantically comprehensive yet compact latent representations of the (image, mask) space.<n>We show that our approach can effectively synthesise unseen high-quality paired segmentation data of remarkable semantic coherence.
arXiv Detail & Related papers (2025-01-18T16:40:53Z) - Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis [55.959002385347645]
Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation.<n>We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation.
arXiv Detail & Related papers (2024-12-30T01:59:34Z) - SurGen: Text-Guided Diffusion Model for Surgical Video Generation [0.6551407780976953]
SurGen is a text-guided diffusion model tailored for surgical video synthesis.
We validate the visual and temporal quality of the outputs using standard image and video generation metrics.
Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
arXiv Detail & Related papers (2024-08-26T05:38:27Z) - Interactive Generation of Laparoscopic Videos with Diffusion Models [1.5488613349551188]
We show how to generate realistic laparoscopic images and videos by specifying a surgical action through text.
We demonstrate the performance of our approach using the publicly available Cholec dataset family.
We achieve an FID of 38.097 and an F1-score of 0.71.
arXiv Detail & Related papers (2024-04-23T12:36:07Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.