Related papers: Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

URL: http://arxiv.org/abs/2602.11024v1
Date: Wed, 11 Feb 2026 16:49:37 GMT
Title: Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting
Authors: Rishikesh Bhyri, Brian R Quaranto, Philip J Seger, Kaity Tung, Brendan Fox, Gene Yang, Steven D. Schwaitzberg, Junsong Yuan, Nan Xi, Peter C W Kim,
Abstract summary: We introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process.<n>This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes.<n>We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images.
Score: 15.430935719365793
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.

Related papers

Future Slot Prediction for Unsupervised Object Discovery in Surgical Video [10.984331138780682]
Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations.<n>Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low.<n>We propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot.
arXiv Detail & Related papers (2025-07-02T16:52:16Z)
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [67.8359850515282]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We show that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z)
Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z)
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
Visual-Kinematics Graph Learning for Procedure-agnostic Instrument Tip Segmentation in Robotic Surgeries [29.201385352740555]
We propose a novel visual-kinematics graph learning framework to accurately segment the instrument tip given various surgical procedures. Specifically, a graph learning framework is proposed to encode relational features of instrument parts from both image and kinematics. A cross-modal contrastive loss is designed to incorporate robust geometric prior from kinematics to image for tip segmentation.
arXiv Detail & Related papers (2023-09-02T14:52:58Z)
GLSFormer : Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches. We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z)
TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery [60.439434751619736]
We propose TraSeTR, a Track-to-Segment Transformer that exploits tracking cues to assist surgical instrument segmentation. TraSeTR jointly reasons about the instrument type, location, and identity with instance-level predictions. The effectiveness of our method is demonstrated with state-of-the-art instrument type segmentation results on three public datasets.
arXiv Detail & Related papers (2022-02-17T05:52:18Z)
FUN-SIS: a Fully UNsupervised approach for Surgical Instrument Segmentation [16.881624842773604]
We present FUN-SIS, a Fully-supervised approach for binary Surgical Instrument. We train a per-frame segmentation model on completely unlabelled endoscopic videos, by relying on implicit motion information and instrument shape-priors. The obtained fully-unsupervised results for surgical instrument segmentation are almost on par with the ones of fully-supervised state-of-the-art approaches.
arXiv Detail & Related papers (2022-02-16T15:32:02Z)
ST-MTL: Spatio-Temporal Multitask Learning Model to Predict Scanpath While Tracking Instruments in Robotic Surgery [14.47768738295518]
Learning of the task-oriented attention while tracking instrument holds vast potential in image-guided robotic surgery. We propose an end-to-end Multi-Task Learning (ST-MTL) model with a shared encoder and Sink-temporal decoders for the real-time surgical instrument segmentation and task-oriented saliency detection. We tackle the problem with a novel asynchronous-temporal optimization technique by calculating independent gradients for each decoder. Compared to the state-of-the-art segmentation and saliency methods, our model most outperforms the evaluation metrics and produces an outstanding performance in challenge
arXiv Detail & Related papers (2021-12-10T15:20:27Z)
Towards Unsupervised Learning for Instrument Segmentation in Robotic Surgery with Cycle-Consistent Adversarial Networks [54.00217496410142]
We propose an unpaired image-to-image translation where the goal is to learn the mapping between an input endoscopic image and a corresponding annotation. Our approach allows to train image segmentation models without the need to acquire expensive annotations. We test our proposed method on Endovis 2017 challenge dataset and show that it is competitive with supervised segmentation methods.
arXiv Detail & Related papers (2020-07-09T01:39:39Z)
Robust Medical Instrument Segmentation Challenge 2019 [56.148440125599905]
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions. Our challenge was based on a surgical data set comprising 10,040 annotated images acquired from a total of 30 surgical procedures. The results confirm the initial hypothesis, namely that algorithm performance degrades with an increasing domain gap.
arXiv Detail & Related papers (2020-03-23T14:35:08Z)
Towards Better Surgical Instrument Segmentation in Endoscopic Vision: Multi-Angle Feature Aggregation and Contour Supervision [22.253074722129053]
We propose a general embeddable approach to improve current deep neural network (DNN) segmentation models. The proposed method is validated with ablation experiments on the novel Sinus-Surgery datasets collected from surgeons' operations.
arXiv Detail & Related papers (2020-02-25T05:28:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.