SurGen: Text-Guided Diffusion Model for Surgical Video Generation
- URL: http://arxiv.org/abs/2408.14028v3
- Date: Tue, 24 Sep 2024 23:23:50 GMT
- Title: SurGen: Text-Guided Diffusion Model for Surgical Video Generation
- Authors: Joseph Cho, Samuel Schmidgall, Cyril Zakka, Mrudang Mathur, Dhamanpreet Kaur, Rohan Shad, William Hiesinger,
- Abstract summary: SurGen is a text-guided diffusion model tailored for surgical video synthesis.
We validate the visual and temporal quality of the outputs using standard image and video generation metrics.
Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
- Score: 0.6551407780976953
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis. SurGen produces videos with the highest resolution and longest duration among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
Related papers
- VISAGE: Video Synthesis using Action Graphs for Surgery [34.21344214645662]
We introduce the novel task of future video generation in laparoscopic surgery.
Our proposed method, VISAGE, leverages the power of action scene graphs to capture the sequential nature of laparoscopic procedures.
Results of our experiments demonstrate high-fidelity video generation for laparoscopy procedures.
arXiv Detail & Related papers (2024-10-23T10:28:17Z) - Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models [1.4042211166197214]
We introduce an LVLM specifically designed for surgical scenarios.
We establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios.
Experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts.
arXiv Detail & Related papers (2024-10-13T07:12:35Z) - Bora: Biomedical Generalist Video Generation Model [20.572771714879856]
This paper introduces Bora, first model designed for text-guided biomedical video generation.
It is fine-tuned through model alignment and instruction tuning using a newly established medical video corpus.
Bora is capable of generating high-quality video data across four distinct biomedical domains.
arXiv Detail & Related papers (2024-07-12T03:00:25Z) - Interactive Generation of Laparoscopic Videos with Diffusion Models [1.5488613349551188]
We show how to generate realistic laparoscopic images and videos by specifying a surgical action through text.
We demonstrate the performance of our approach using the publicly available Cholec dataset family.
We achieve an FID of 38.097 and an F1-score of 0.71.
arXiv Detail & Related papers (2024-04-23T12:36:07Z) - Endora: Video Generation Models as Endoscopy Simulators [53.72175969751398]
This paper introduces model, an innovative approach to generate medical videos that simulate clinical endoscopy scenes.
We also pioneer the first public benchmark for endoscopy simulation with video generation models.
Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research.
arXiv Detail & Related papers (2024-03-17T00:51:59Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics.
We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals.
We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.