SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis
- URL: http://arxiv.org/abs/2506.03082v2
- Date: Fri, 13 Jun 2025 17:00:16 GMT
- Title: SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis
- Authors: Ssharvien Kumar Sivakumar, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay,
- Abstract summary: We introduce SG2VID, the first diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control.<n>We demonstrate SG2VID's capabilities across three public datasets featuring cataract and cholecystectomy surgery.
- Score: 1.0808810256442274
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Surgical simulation plays a pivotal role in training novice surgeons, accelerating their learning curve and reducing intra-operative errors. However, conventional simulation tools fall short in providing the necessary photorealism and the variability of human anatomy. In response, current methods are shifting towards generative model-based simulators. Yet, these approaches primarily focus on using increasingly complex conditioning for precise synthesis while neglecting the fine-grained human control aspect. To address this gap, we introduce SG2VID, the first diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control. We demonstrate SG2VID's capabilities across three public datasets featuring cataract and cholecystectomy surgery. While SG2VID outperforms previous methods both qualitatively and quantitatively, it also enables precise synthesis, providing accurate control over tool and anatomy's size and movement, entrance of new tools, as well as the overall scene layout. We qualitatively motivate how SG2VID can be used for generative augmentation and present an experiment demonstrating its ability to improve a downstream phase detection task when the training set is extended with our synthetic videos. Finally, to showcase SG2VID's ability to retain human control, we interact with the Scene Graphs to generate new video samples depicting major yet rare intra-operative irregularities.
Related papers
- Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation [21.424029706788883]
We introduce Video Diffusion for Action Reasoning (Vidar)<n>We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms.<n>With only 20 minutes of human demonstrations on an unseen robot platform, Vidar generalizes to unseen tasks and backgrounds with strong semantic understanding.
arXiv Detail & Related papers (2025-07-17T08:31:55Z) - Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models [1.5678321653327674]
We propose a two-stage, text-based method to generate high-fidelity surgical videos for under-represented classes.<n>We evaluate our method on two downstream tasks--action recognition and intra-operative event prediction-demonstrating.
arXiv Detail & Related papers (2025-05-14T23:43:29Z) - Instrument-Splatting: Controllable Photorealistic Reconstruction of Surgical Instruments Using Gaussian Splatting [15.51259636712844]
Real2Sim is becoming increasingly important with the rapid development of surgical artificial intelligence (AI) and autonomy.<n>We propose a novel Real2Sim methodology, Instrument-Splatting, that leverages 3D Gaussian Splatting to provide fully controllable 3D reconstruction of surgical instruments from monocular surgical videos.
arXiv Detail & Related papers (2025-03-06T04:37:09Z) - SurgSora: Decoupled RGBD-Flow Diffusion Model for Controllable Surgical Video Generation [25.963369099780113]
SurgSora is a motion-controllable surgical video generation framework that uses a single input frame and user-controllable motion cues.<n>The SurgSora consists of three key modules: the Dual Semantic (DSI), which extracts object-relevant RGB and depth features from the input frame.<n>The Decoupled Flow Mapper (DFM), which fuses optical flow with semantic--RGBD features at multiple scales to enhance temporal understanding and object dynamics.<n>The Trajectory Controller (TC), which allows users to specify motion directions and estimates sparse optical flow, guiding the video generation process.
arXiv Detail & Related papers (2024-12-18T16:34:51Z) - VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - VISAGE: Video Synthesis using Action Graphs for Surgery [34.21344214645662]
We introduce the novel task of future video generation in laparoscopic surgery.
Our proposed method, VISAGE, leverages the power of action scene graphs to capture the sequential nature of laparoscopic procedures.
Results of our experiments demonstrate high-fidelity video generation for laparoscopy procedures.
arXiv Detail & Related papers (2024-10-23T10:28:17Z) - Expressive Gaussian Human Avatars from Monocular RGB Video [69.56388194249942]
We introduce EVA, a drivable human model that meticulously sculpts fine details based on 3D Gaussians and SMPL-X.
We highlight the critical importance of aligning the SMPL-X model with RGB frames for effective avatar learning.
We propose a context-aware adaptive density control strategy, which is adaptively adjusting the gradient thresholds.
arXiv Detail & Related papers (2024-07-03T15:36:27Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes [59.23385953161328]
Novel view synthesis for dynamic scenes is still a challenging problem in computer vision and graphics.
We propose a new representation that explicitly decomposes the motion and appearance of dynamic scenes into sparse control points and dense Gaussians.
Our method can enable user-controlled motion editing while retaining high-fidelity appearances.
arXiv Detail & Related papers (2023-12-04T11:57:14Z) - Skeleton2Humanoid: Animating Simulated Characters for
Physically-plausible Motion In-betweening [59.88594294676711]
Modern deep learning based motion synthesis approaches barely consider the physical plausibility of synthesized motions.
We propose a system Skeleton2Humanoid'' which performs physics-oriented motion correction at test time.
Experiments on the challenging LaFAN1 dataset show our system can outperform prior methods significantly in terms of both physical plausibility and accuracy.
arXiv Detail & Related papers (2022-10-09T16:15:34Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.