Pixel-Wise Recognition for Holistic Surgical Scene Understanding
- URL: http://arxiv.org/abs/2401.11174v2
- Date: Fri, 26 Jan 2024 04:24:07 GMT
- Title: Pixel-Wise Recognition for Holistic Surgical Scene Understanding
- Authors: Nicol\'as Ayobi and Santiago Rodr\'iguez and Alejandra P\'erez and
Isabela Hern\'andez and Nicol\'as Aparicio and Eug\'enie Dessevres and
Sebasti\'an Pe\~na and Jessica Santander and Juan Ignacio Caicedo and
Nicol\'as Fern\'andez and Pablo Arbel\'aez
- Abstract summary: This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset.
GraSP is a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity.
We introduce the Transformers for Actions, Phases, Steps, and Instrument (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals.
- Score: 31.338288460529046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the Holistic and Multi-Granular Surgical Scene
Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that
models surgical scene understanding as a hierarchy of complementary tasks with
varying levels of granularity. Our approach enables a multi-level comprehension
of surgical activities, encompassing long-term tasks such as surgical phases
and steps recognition and short-term tasks including surgical instrument
segmentation and atomic visual actions detection. To exploit our proposed
benchmark, we introduce the Transformers for Actions, Phases, Steps, and
Instrument Segmentation (TAPIS) model, a general architecture that combines a
global video feature extractor with localized region proposals from an
instrument segmentation model to tackle the multi-granularity of our benchmark.
Through extensive experimentation, we demonstrate the impact of including
segmentation annotations in short-term recognition tasks, highlight the varying
granularity requirements of each task, and establish TAPIS's superiority over
previously proposed baselines and conventional CNN-based models. Additionally,
we validate the robustness of our method across multiple public benchmarks,
confirming the reliability and applicability of our dataset. This work
represents a significant step forward in Endoscopic Vision, offering a novel
and comprehensive framework for future research towards a holistic
understanding of surgical procedures.
Related papers
- Hypergraph-Transformer (HGT) for Interactive Event Prediction in
Laparoscopic and Robotic Surgery [50.3022015601057]
We propose a predictive neural network that is capable of understanding and predicting critical interactive aspects of surgical workflow from intra-abdominal video.
We verify our approach on established surgical datasets and applications, including the detection and prediction of action triplets.
Our results demonstrate the superiority of our approach compared to unstructured alternatives.
arXiv Detail & Related papers (2024-02-03T00:58:05Z) - SAR-RARP50: Segmentation of surgical instrumentation and Action
Recognition on Robot-Assisted Radical Prostatectomy Challenge [72.97934765570069]
We release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP)
The aim of the challenge is to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain.
A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation.
arXiv Detail & Related papers (2023-12-31T13:32:18Z) - GLSFormer : Gated - Long, Short Sequence Transformer for Step
Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches.
We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z) - Text Promptable Surgical Instrument Segmentation with Vision-Language
Models [16.203166812021045]
We propose a novel text promptable surgical instrument segmentation approach to overcome challenges associated with diversity and differentiation of surgical instruments.
We leverage pretrained image and text encoders as our model backbone and design a text promptable mask decoder.
Experiments on several surgical instrument segmentation datasets demonstrate our model's superior performance and promising generalization capability.
arXiv Detail & Related papers (2023-06-15T16:26:20Z) - Towards Holistic Surgical Scene Understanding [1.004785607987398]
We present a new experimental framework towards holistic surgical scene understanding.
First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) dataset.
Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding.
arXiv Detail & Related papers (2022-12-08T22:15:27Z) - CholecTriplet2021: A benchmark challenge for surgical action triplet
recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos.
We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.
A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z) - FUN-SIS: a Fully UNsupervised approach for Surgical Instrument
Segmentation [16.881624842773604]
We present FUN-SIS, a Fully-supervised approach for binary Surgical Instrument.
We train a per-frame segmentation model on completely unlabelled endoscopic videos, by relying on implicit motion information and instrument shape-priors.
The obtained fully-unsupervised results for surgical instrument segmentation are almost on par with the ones of fully-supervised state-of-the-art approaches.
arXiv Detail & Related papers (2022-02-16T15:32:02Z) - Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical
Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views.
We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z) - Simulation-to-Real domain adaptation with teacher-student learning for
endoscopic instrument segmentation [1.1047993346634768]
We introduce a teacher-student learning approach that learns jointly from annotated simulation data and unlabeled real data.
Empirical results on three datasets highlight the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-03-02T09:30:28Z) - Robust Medical Instrument Segmentation Challenge 2019 [56.148440125599905]
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions.
Our challenge was based on a surgical data set comprising 10,040 annotated images acquired from a total of 30 surgical procedures.
The results confirm the initial hypothesis, namely that algorithm performance degrades with an increasing domain gap.
arXiv Detail & Related papers (2020-03-23T14:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.