SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation
- URL: http://arxiv.org/abs/2412.14018v3
- Date: Sun, 22 Jun 2025 15:21:04 GMT
- Title: SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation
- Authors: Tong Chen, Shuya Yang, Junyi Wang, Long Bai, Hongliang Ren, Luping Zhou,
- Abstract summary: SurgSora is a framework that generates high-fidelity, motion-controllable surgical videos from a single input frame and user-specified motion cues.<n>By conditioning these enriched features within the Stable Video Diffusion, SurgSora achieves state-of-the-art visual authenticity and controllability.
- Score: 25.963369099780113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surgical video generation can enhance medical education and research, but existing methods lack fine-grained motion control and realism. We introduce SurgSora, a framework that generates high-fidelity, motion-controllable surgical videos from a single input frame and user-specified motion cues. Unlike prior approaches that treat objects indiscriminately or rely on ground-truth segmentation masks, SurgSora leverages self-predicted object features and depth information to refine RGB appearance and optical flow for precise video synthesis. It consists of three key modules: (1) the Dual Semantic Injector, which extracts object-specific RGB-D features and segmentation cues to enhance spatial representations; (2) the Decoupled Flow Mapper, which fuses multi-scale optical flow with semantic features for realistic motion dynamics; and (3) the Trajectory Controller, which estimates sparse optical flow and enables user-guided object movement. By conditioning these enriched features within the Stable Video Diffusion, SurgSora achieves state-of-the-art visual authenticity and controllability in advancing surgical video synthesis, as demonstrated by extensive quantitative and qualitative comparisons. Our human evaluation in collaboration with expert surgeons further demonstrates the high realism of SurgSora-generated videos, highlighting the potential of our method for surgical training and education. Our project is available at https://surgsora.github.io/surgsora.github.io.
Related papers
- HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation [44.37374628674769]
We propose HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models.<n>The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.
arXiv Detail & Related papers (2025-06-26T14:07:23Z) - SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z) - SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis [1.0808810256442274]
We introduce SG2VID, the first diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control.<n>We demonstrate SG2VID's capabilities across three public datasets featuring cataract and cholecystectomy surgery.
arXiv Detail & Related papers (2025-06-03T17:02:38Z) - Instrument-Splatting: Controllable Photorealistic Reconstruction of Surgical Instruments Using Gaussian Splatting [15.51259636712844]
Real2Sim is becoming increasingly important with the rapid development of surgical artificial intelligence (AI) and autonomy.<n>We propose a novel Real2Sim methodology, Instrument-Splatting, that leverages 3D Gaussian Splatting to provide fully controllable 3D reconstruction of surgical instruments from monocular surgical videos.
arXiv Detail & Related papers (2025-03-06T04:37:09Z) - SurGrID: Controllable Surgical Simulation via Scene Graph to Image Diffusion [0.8680185045005854]
We introduce SurGrID, a Scene Graph to Image Diffusion Model, allowing for controllable surgical scene synthesis.
Scene Graphs encode a surgical scene's components' spatial and semantic information, which are then translated into an intermediate representation.
Our proposed method improves the fidelity of generated images and their coherence with the graph input over the state-of-the-art.
arXiv Detail & Related papers (2025-02-11T20:49:13Z) - VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation [62.64811405314847]
VidCRAFT3 is a novel framework for precise image-to-video generation.<n>It enables control over camera motion, object motion, and lighting direction simultaneously.<n>It produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence.
arXiv Detail & Related papers (2025-02-11T13:11:59Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - VISAGE: Video Synthesis using Action Graphs for Surgery [34.21344214645662]
We introduce the novel task of future video generation in laparoscopic surgery.
Our proposed method, VISAGE, leverages the power of action scene graphs to capture the sequential nature of laparoscopic procedures.
Results of our experiments demonstrate high-fidelity video generation for laparoscopy procedures.
arXiv Detail & Related papers (2024-10-23T10:28:17Z) - Multi-Layer Gaussian Splatting for Immersive Anatomy Visualization [1.0580610673031074]
In medical image visualization, path tracing of volumetric medical data like CT scans produces lifelike visualizations.
We propose a novel approach utilizing GS to create an efficient but static intermediate representation of CT scans.
Our approach achieves interactive frame rates while preserving anatomical structures, with quality adjustable to the target hardware.
arXiv Detail & Related papers (2024-10-22T12:56:58Z) - SurGen: Text-Guided Diffusion Model for Surgical Video Generation [0.6551407780976953]
SurGen is a text-guided diffusion model tailored for surgical video synthesis.
We validate the visual and temporal quality of the outputs using standard image and video generation metrics.
Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
arXiv Detail & Related papers (2024-08-26T05:38:27Z) - Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity [13.04953215936574]
We propose a two-stage model named Mind-Animator to reconstruct human dynamic vision from brain activity.
During the fMRI-to-feature stage, we decouple semantic, structure, and motion features from fMRI.
In the feature-to-video stage, these features are integrated into videos using an inflated Stable Diffusion.
arXiv Detail & Related papers (2024-05-06T08:56:41Z) - FLex: Joint Pose and Dynamic Radiance Fields Optimization for Stereo Endoscopic Videos [79.50191812646125]
Reconstruction of endoscopic scenes is an important asset for various medical applications, from post-surgery analysis to educational training.
We adress the challenging setup of a moving endoscope within a highly dynamic environment of deforming tissue.
We propose an implicit scene separation into multiple overlapping 4D neural radiance fields (NeRFs) and a progressive optimization scheme jointly optimizing for reconstruction and camera poses from scratch.
This improves the ease-of-use and allows to scale reconstruction capabilities in time to process surgical videos of 5,000 frames and more; an improvement of more than ten times compared to the state of the art while being agnostic to external tracking information
arXiv Detail & Related papers (2024-03-18T19:13:02Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis.
Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame.
We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z) - Controllable Mind Visual Diffusion Model [58.83896307930354]
Brain signal visualization has emerged as an active research area, serving as a critical interface between the human visual system and computer vision models.
We propose a novel approach, referred to as Controllable Mind Visual Model Diffusion (CMVDM)
CMVDM extracts semantic and silhouette information from fMRI data using attribute alignment and assistant networks.
We then leverage a control model to fully exploit the extracted information for image synthesis, resulting in generated images that closely resemble the visual stimuli in terms of semantics and silhouette.
arXiv Detail & Related papers (2023-05-17T11:36:40Z) - Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes.
We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z) - Improving Unsupervised Video Object Segmentation with Motion-Appearance
Synergy [52.03068246508119]
We present IMAS, a method that segments the primary objects in videos without manual annotation in training or inference.
IMAS achieves Improved UVOS with Motion-Appearance Synergy.
We demonstrate its effectiveness in tuning critical hyperparams previously tuned with human annotation or hand-crafted hyperparam-specific metrics.
arXiv Detail & Related papers (2022-12-17T06:47:30Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z) - Multi-frame Feature Aggregation for Real-time Instrument Segmentation in
Endoscopic Video [11.100734994959419]
We propose a novel Multi-frame Feature Aggregation (MFFA) module to aggregate video frame features temporally and spatially.
We also develop a method that can randomly synthesize a surgical frame sequence from a single labeled frame to assist network training.
arXiv Detail & Related papers (2020-11-17T16:27:27Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.