Video-based Surgical Tool-tip and Keypoint Tracking using Multi-frame Context-driven Deep Learning Models
- URL: http://arxiv.org/abs/2501.18361v1
- Date: Thu, 30 Jan 2025 14:06:19 GMT
- Title: Video-based Surgical Tool-tip and Keypoint Tracking using Multi-frame Context-driven Deep Learning Models
- Authors: Bhargav Ghanekar, Lianne R. Johnson, Jacob L. Laughlin, Marcia K. O'Malley, Ashok Veeraraghavan,
- Abstract summary: We propose a novel, multi-frame context-driven deep learning framework to localize and track tool keypoints in surgical videos.<n>We achieve 90% keypoint detection accuracy and a localization RMS error of 5.27 pixels.<n>Such a framework paves the way for accurately tracking surgical instrument keypoints, enabling further downstream use cases.
- Score: 13.395436436900399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated tracking of surgical tool keypoints in robotic surgery videos is an essential task for various downstream use cases such as skill assessment, expertise assessment, and the delineation of safety zones. In recent years, the explosion of deep learning for vision applications has led to many works in surgical instrument segmentation, while lesser focus has been on tracking specific tool keypoints, such as tool tips. In this work, we propose a novel, multi-frame context-driven deep learning framework to localize and track tool keypoints in surgical videos. We train and test our models on the annotated frames from the 2015 EndoVis Challenge dataset, resulting in state-of-the-art performance. By leveraging sophisticated deep learning models and multi-frame context, we achieve 90\% keypoint detection accuracy and a localization RMS error of 5.27 pixels. Results on a self-annotated JIGSAWS dataset with more challenging scenarios also show that the proposed multi-frame models can accurately track tool-tip and tool-base keypoints, with ${<}4.2$-pixel RMS error overall. Such a framework paves the way for accurately tracking surgical instrument keypoints, enabling further downstream use cases. Project and dataset webpage: https://tinyurl.com/mfc-tracker
Related papers
- Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI [15.513949299806582]
The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis.
We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries.
We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos.
arXiv Detail & Related papers (2025-04-28T15:46:02Z) - Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - SurgiTrack: Fine-Grained Multi-Class Multi-Tool Tracking in Surgical Videos [2.3810490830445183]
This work sets a new standard in surgical tool tracking, providing dynamic trajectories for more adaptable and precise assistance in minimally invasive surgeries.<n>We propose SurgiTrack, a novel deep learning method that leverages YOLOv7 for precise tool detection and employs an attention mechanism to model the originating direction of the tools, as a proxy to their operators, for tool re-identification.
arXiv Detail & Related papers (2024-05-30T17:59:10Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - SurgMAE: Masked Autoencoders for Long Surgical Video Analysis [4.866110274299399]
Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs)
In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain.
We propose SurgMAE, which is a novel architecture with a masking strategy on sampling high-temporal tokens for MAE.
arXiv Detail & Related papers (2023-05-19T06:12:50Z) - End-to-end Reinforcement Learning of Robotic Manipulation with Robust
Keypoints Representation [7.374994747693731]
We present an end-to-end Reinforcement Learning framework for robotic manipulation tasks, using a robust and efficient keypoints representation.
The proposed method learns keypoints from camera images as the state representation, through a self-supervised autoencoder architecture.
We demonstrate the effectiveness of the proposed method on robotic manipulation tasks including grasping and pushing, in different scenarios.
arXiv Detail & Related papers (2022-02-12T09:58:09Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - Pose Estimation for Robot Manipulators via Keypoint Optimization and
Sim-to-Real Transfer [10.369766652751169]
Keypoint detection is an essential building block for many robotic applications.
Deep learning methods have the ability to detect user-defined keypoints in a marker-less manner.
We propose a new and autonomous way to define the keypoint locations that overcomes these challenges.
arXiv Detail & Related papers (2020-10-15T22:38:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.