Related papers: SurGen: Text-Guided Diffusion Model for Surgical Video Generation

Related papers

Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders [59.98236644320787]
We show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders.<n>We present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training.
arXiv Detail & Related papers (2025-09-11T15:39:27Z)
SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding [75.00667948967848]
The SurgLLM framework is a large multimodal model tailored for versatile surgical video understanding tasks.<n>To empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM.<n>To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings.
arXiv Detail & Related papers (2025-08-30T04:36:41Z)
EndoGen: Conditional Autoregressive Endoscopic Video Generation [51.97720772069513]
We propose the first conditional endoscopic video generation framework, namely EndoGen.<n>Specifically, we build an autoregressive model with a tailored Spatiotemporal Grid-Frame Patterning strategy.<n>We demonstrate the effectiveness of our framework in generating high-quality, conditionally guided endoscopic content.
arXiv Detail & Related papers (2025-07-23T10:32:20Z)
HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation [44.37374628674769]
We propose HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models.<n>The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.
arXiv Detail & Related papers (2025-06-26T14:07:23Z)
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z)
Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models [1.5678321653327674]
We propose a two-stage, text-based method to generate high-fidelity surgical videos for under-represented classes.<n>We evaluate our method on two downstream tasks--action recognition and intra-operative event prediction-demonstrating.
arXiv Detail & Related papers (2025-05-14T23:43:29Z)
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI [15.513949299806582]
The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis. We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries. We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos.
arXiv Detail & Related papers (2025-04-28T15:46:02Z)
Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks [0.35087986342428684]
We introduce diffusion-based temporal models that capture the dynamics of fine-grained robotic sub-stitch actions. We fine-tune two state-of-the-art video diffusion models to generate high-fidelity surgical action sequences at $ge$Lox resolution and $ge$49 frames. Our experimental results demonstrate that these world models can effectively capture the dynamics of suturing, potentially enabling improved training, skill assessment tools, and autonomous surgical systems.
arXiv Detail & Related papers (2025-03-16T14:51:12Z)
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z)
VISAGE: Video Synthesis using Action Graphs for Surgery [34.21344214645662]
We introduce the novel task of future video generation in laparoscopic surgery. Our proposed method, VISAGE, leverages the power of action scene graphs to capture the sequential nature of laparoscopic procedures. Results of our experiments demonstrate high-fidelity video generation for laparoscopy procedures.
arXiv Detail & Related papers (2024-10-23T10:28:17Z)
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models [1.4042211166197214]
We introduce an LVLM specifically designed for surgical scenarios. We establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts.
arXiv Detail & Related papers (2024-10-13T07:12:35Z)
Bora: Biomedical Generalist Video Generation Model [20.572771714879856]
This paper introduces Bora, first model designed for text-guided biomedical video generation. It is fine-tuned through model alignment and instruction tuning using a newly established medical video corpus. Bora is capable of generating high-quality video data across four distinct biomedical domains.
arXiv Detail & Related papers (2024-07-12T03:00:25Z)
Interactive Generation of Laparoscopic Videos with Diffusion Models [1.5488613349551188]
We show how to generate realistic laparoscopic images and videos by specifying a surgical action through text. We demonstrate the performance of our approach using the publicly available Cholec dataset family. We achieve an FID of 38.097 and an F1-score of 0.71.
arXiv Detail & Related papers (2024-04-23T12:36:07Z)
Endora: Video Generation Models as Endoscopy Simulators [53.72175969751398]
This paper introduces model, an innovative approach to generate medical videos that simulate clinical endoscopy scenes. We also pioneer the first public benchmark for endoscopy simulation with video generation models. Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research.
arXiv Detail & Related papers (2024-03-17T00:51:59Z)
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics. We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals. We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models. We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z)
Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.