Related papers: Surgical Text-to-Image Generation

Surgical Text-to-Image Generation

URL: http://arxiv.org/abs/2407.09230v1
Date: Fri, 12 Jul 2024 12:49:11 GMT
Title: Surgical Text-to-Image Generation
Authors: Chinedu Innocent Nwoye, Rupak Bose, Kareem Elgohary, Lorenzo Arboit, Giorgio Carlino, Joël L. Lavanchy, Pietro Mascagni, Nicolas Padoy,
Abstract summary: We conduct an in-depth analysis on adapting text-to-image generative models for the surgical domain. We investigate various language models and find T5 to offer more distinct features for differentiating surgical actions based on triplet-based textual inputs. We develop Surgical Imagen to generate photorealistic and activity-aligned surgical images from triplet-based textual prompts.
Score: 1.958913666074613
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Acquiring surgical data for research and development is significantly hindered by high annotation costs and practical and ethical constraints. Utilizing synthetically generated images could offer a valuable alternative. In this work, we conduct an in-depth analysis on adapting text-to-image generative models for the surgical domain, leveraging the CholecT50 dataset, which provides surgical images annotated with surgical action triplets (instrument, verb, target). We investigate various language models and find T5 to offer more distinct features for differentiating surgical actions based on triplet-based textual inputs. Our analysis demonstrates strong alignment between long and triplet-based captions, supporting the use of triplet-based labels. We address the challenges in training text-to-image models on triplet-based captions without additional input signals by uncovering that triplet text embeddings are instrument-centric in the latent space and then, by designing an instrument-based class balancing technique to counteract the imbalance and skewness in the surgical data, improving training convergence. Extending Imagen, a diffusion-based generative model, we develop Surgical Imagen to generate photorealistic and activity-aligned surgical images from triplet-based textual prompts. We evaluate our model using diverse metrics, including human expert surveys and automated methods like FID and CLIP scores. We assess the model performance on key aspects: quality, alignment, reasoning, knowledge, and robustness, demonstrating the effectiveness of our approach in providing a realistic alternative to real data collection.

Related papers

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining [0.8714814768600079]
We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme.<n>Our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text.
arXiv Detail & Related papers (2025-05-31T16:41:55Z)
Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework. Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z)
SurgicaL-CD: Generating Surgical Images via Unpaired Image Translation with Latent Consistency Diffusion Models [1.6189876649941652]
We introduce emphSurgicaL-CD, a consistency-distilled diffusion method to generate realistic surgical images. Our results demonstrate that our method outperforms GANs and diffusion-based approaches.
arXiv Detail & Related papers (2024-08-19T09:19:25Z)
Realistic Surgical Image Dataset Generation Based On 3D Gaussian Splatting [3.5351922399745166]
This research introduces a novel method that employs 3D Gaussian Splatting to generate synthetic surgical datasets. We developed a data recording system capable of acquiring images alongside tool and camera poses in a surgical scene. Using this pose data, we synthetically replicate the scene, thereby enabling direct comparisons of the synthetic image quality.
arXiv Detail & Related papers (2024-07-20T11:20:07Z)
Surgical Triplet Recognition via Diffusion Model [59.50938852117371]
Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms. We propose Difft, a new generative framework for surgical triplet recognition employing the diffusion model. Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition.
arXiv Detail & Related papers (2024-06-19T04:43:41Z)
Navigating the Synthetic Realm: Harnessing Diffusion-based Models for Laparoscopic Text-to-Image Generation [3.2039076408339353]
We present an intuitive approach for generating synthetic laparoscopic images from short text prompts using diffusion-based generative models. Results on fidelity and diversity demonstrate that diffusion-based models can acquire knowledge about the style and semantics in the field of image-guided surgery.
arXiv Detail & Related papers (2023-12-05T16:20:22Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Rethinking Surgical Instrument Segmentation: A Background Image Can Be All You Need [18.830738606514736]
Data scarcity and imbalance have heavily affected the model accuracy and limited the design and deployment of deep learning-based surgical applications. We propose a one-to-many data generation solution that gets rid of the complicated and expensive process of data collection and annotation from robotic surgery. Our empirical analysis suggests that without the high cost of data collection and annotation, we can achieve decent surgical instrument segmentation performance.
arXiv Detail & Related papers (2022-06-23T16:22:56Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
Towards Unsupervised Learning for Instrument Segmentation in Robotic Surgery with Cycle-Consistent Adversarial Networks [54.00217496410142]
We propose an unpaired image-to-image translation where the goal is to learn the mapping between an input endoscopic image and a corresponding annotation. Our approach allows to train image segmentation models without the need to acquire expensive annotations. We test our proposed method on Endovis 2017 challenge dataset and show that it is competitive with supervised segmentation methods.
arXiv Detail & Related papers (2020-07-09T01:39:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.