HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation
- URL: http://arxiv.org/abs/2506.21287v1
- Date: Thu, 26 Jun 2025 14:07:23 GMT
- Title: HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation
- Authors: Diego Biagini, Nassir Navab, Azade Farshad,
- Abstract summary: We propose HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models.<n>The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.
- Score: 44.37374628674769
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.
Related papers
- Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models [56.2236083600999]
We propose a novel hierarchical input-dependent state space model for surgical video analysis.<n>Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information.<n> Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-06-26T14:43:57Z) - SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z) - Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z) - Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models [1.5678321653327674]
We propose a two-stage, text-based method to generate high-fidelity surgical videos for under-represented classes.<n>We evaluate our method on two downstream tasks--action recognition and intra-operative event prediction-demonstrating.
arXiv Detail & Related papers (2025-05-14T23:43:29Z) - SASVi - Segment Any Surgical Video [2.330834737588252]
We propose SASVi, a novel re-prompting mechanism based on a frame-wise Mask R-CNN Overseer model.<n>This model automatically re-prompts the foundation model SAM2 when the scene constellation changes.
arXiv Detail & Related papers (2025-02-12T00:29:41Z) - Is Segment Anything Model 2 All You Need for Surgery Video Segmentation? A Systematic Evaluation [25.459372606957736]
In this paper, we systematically evaluate the performance of SAM2 model in zero-shot surgery video segmentation task.<n>We conducted experiments under different configurations, including different prompting strategies, robustness, etc.
arXiv Detail & Related papers (2024-12-31T16:20:05Z) - VISAGE: Video Synthesis using Action Graphs for Surgery [34.21344214645662]
We introduce the novel task of future video generation in laparoscopic surgery.
Our proposed method, VISAGE, leverages the power of action scene graphs to capture the sequential nature of laparoscopic procedures.
Results of our experiments demonstrate high-fidelity video generation for laparoscopy procedures.
arXiv Detail & Related papers (2024-10-23T10:28:17Z) - SurGen: Text-Guided Diffusion Model for Surgical Video Generation [0.6551407780976953]
SurGen is a text-guided diffusion model tailored for surgical video synthesis.
We validate the visual and temporal quality of the outputs using standard image and video generation metrics.
Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
arXiv Detail & Related papers (2024-08-26T05:38:27Z) - Endora: Video Generation Models as Endoscopy Simulators [53.72175969751398]
This paper introduces model, an innovative approach to generate medical videos that simulate clinical endoscopy scenes.
We also pioneer the first public benchmark for endoscopy simulation with video generation models.
Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research.
arXiv Detail & Related papers (2024-03-17T00:51:59Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.