Surgical Instruction Generation with Transformers
- URL: http://arxiv.org/abs/2107.06964v1
- Date: Wed, 14 Jul 2021 19:54:50 GMT
- Title: Surgical Instruction Generation with Transformers
- Authors: Jinglu Zhang, Yinyu Nie, Jian Chang, and Jian Jun Zhang
- Abstract summary: We introduce a transformer-backboned encoder-decoder network with self-critical reinforcement learning to generate instructions from surgical images.
We evaluate the effectiveness of our method on DAISI dataset, which includes 290 procedures from various medical disciplines.
- Score: 6.97857490403095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic surgical instruction generation is a prerequisite towards
intra-operative context-aware surgical assistance. However, generating
instructions from surgical scenes is challenging, as it requires jointly
understanding the surgical activity of current view and modelling relationships
between visual information and textual description. Inspired by the neural
machine translation and imaging captioning tasks in open domain, we introduce a
transformer-backboned encoder-decoder network with self-critical reinforcement
learning to generate instructions from surgical images. We evaluate the
effectiveness of our method on DAISI dataset, which includes 290 procedures
from various medical disciplines. Our approach outperforms the existing
baseline over all caption evaluation metrics. The results demonstrate the
benefits of the encoder-decoder structure backboned by transformer in handling
multimodal context.
Related papers
- OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [55.15365161143354]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.
OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.
Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z) - Surgical Scene Segmentation by Transformer With Asymmetric Feature Enhancement [7.150163844454341]
Vision-specific transformer method is a promising way for surgical scene understanding.
We propose a novel Transformer-based framework with an Asymmetric Feature Enhancement module (TAFE)
The proposed method outperforms the SOTA methods in several different surgical segmentation tasks and additionally proves its ability of fine-grained structure recognition.
arXiv Detail & Related papers (2024-10-23T07:58:47Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.
We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - Hypergraph-Transformer (HGT) for Interactive Event Prediction in
Laparoscopic and Robotic Surgery [50.3022015601057]
We propose a predictive neural network that is capable of understanding and predicting critical interactive aspects of surgical workflow from intra-abdominal video.
We verify our approach on established surgical datasets and applications, including the detection and prediction of action triplets.
Our results demonstrate the superiority of our approach compared to unstructured alternatives.
arXiv Detail & Related papers (2024-02-03T00:58:05Z) - Cataract-1K: Cataract Surgery Dataset for Scene Segmentation, Phase
Recognition, and Irregularity Detection [5.47960852753243]
We present the largest cataract surgery video dataset that addresses diverse requisites for constructing computerized surgical workflow analysis.
We validate the quality of annotations by benchmarking the performance of several state-of-the-art neural network architectures.
The dataset and annotations will be publicly available upon acceptance of the paper.
arXiv Detail & Related papers (2023-12-11T10:53:05Z) - Event Recognition in Laparoscopic Gynecology Videos with Hybrid
Transformers [4.371909393924804]
We introduce a dataset tailored for relevant event recognition in laparoscopic videos.
Our dataset includes annotations for critical events associated with major intra-operative challenges and post-operative complications.
We evaluate a hybrid transformer architecture coupled with a customized training-inference framework to recognize four specific events in laparoscopic surgery videos.
arXiv Detail & Related papers (2023-12-01T13:57:29Z) - Visual-Kinematics Graph Learning for Procedure-agnostic Instrument Tip
Segmentation in Robotic Surgeries [29.201385352740555]
We propose a novel visual-kinematics graph learning framework to accurately segment the instrument tip given various surgical procedures.
Specifically, a graph learning framework is proposed to encode relational features of instrument parts from both image and kinematics.
A cross-modal contrastive loss is designed to incorporate robust geometric prior from kinematics to image for tip segmentation.
arXiv Detail & Related papers (2023-09-02T14:52:58Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics.
We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals.
We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - Text Promptable Surgical Instrument Segmentation with Vision-Language
Models [16.203166812021045]
We propose a novel text promptable surgical instrument segmentation approach to overcome challenges associated with diversity and differentiation of surgical instruments.
We leverage pretrained image and text encoders as our model backbone and design a text promptable mask decoder.
Experiments on several surgical instrument segmentation datasets demonstrate our model's superior performance and promising generalization capability.
arXiv Detail & Related papers (2023-06-15T16:26:20Z) - Live image-based neurosurgical guidance and roadmap generation using
unsupervised embedding [53.992124594124896]
We present a method for live image-only guidance leveraging a large data set of annotated neurosurgical videos.
A generated roadmap encodes the common anatomical paths taken in surgeries in the training set.
We trained and evaluated the proposed method with a data set of 166 transsphenoidal adenomectomy procedures.
arXiv Detail & Related papers (2023-03-31T12:52:24Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.