Related papers: YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos

URL: http://arxiv.org/abs/2602.18959v1
Date: Sat, 21 Feb 2026 21:41:56 GMT
Title: YOLOv10-Based Multi-Task Framework for Hand Localization and Laterality Classification in Surgical Videos
Authors: Kedi Sun, Le Zhang,
Abstract summary: We propose a framework that simultaneously localizes hands and classifies their laterality in complex surgical scenes.<n>The model is trained on the Trauma THOMPSON Challenge 2025 Task 2 dataset.
Score: 5.504955093712013
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-time hand tracking in trauma surgery is essential for supporting rapid and precise intraoperative decisions. We propose a YOLOv10-based framework that simultaneously localizes hands and classifies their laterality (left or right) in complex surgical scenes. The model is trained on the Trauma THOMPSON Challenge 2025 Task 2 dataset, consisting of first-person surgical videos with annotated hand bounding boxes. Extensive data augmentation and a multi-task detection design improve robustness against motion blur, lighting variations, and diverse hand appearances. Evaluation demonstrates accurate left-hand (67\%) and right-hand (71\%) classification, while distinguishing hands from the background remains challenging. The model achieves an $mAP_{[0.5:0.95]}$ of 0.33 and maintains real-time inference, highlighting its potential for intraoperative deployment. This work establishes a foundation for advanced hand-instrument interaction analysis in emergency surgical procedures.

Related papers

UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos [81.9180187964947]
We present UniSurg, a foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction.<n>To enable large-scale pretraining, we curate the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions.<n>These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
arXiv Detail & Related papers (2026-02-05T13:18:33Z)
MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts [1.6646268910871171]
We present a supervised Mixture-of-Experts architecture designed for phase-structured surgical manipulation tasks.<n>We show that a lightweight action decoder policy can learn complex, long-horizon manipulation from less than 150 demonstrations.<n>We present preliminary results of policy roll-outs during in vivo porcine surgery.
arXiv Detail & Related papers (2026-01-29T16:50:14Z)
A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery [1.120882117110929]
We propose a robust pipeline for 3D hand pose estimation in surgical contexts.<n>The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction.<n>We introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses.
arXiv Detail & Related papers (2026-01-22T12:48:24Z)
Surg-SegFormer: A Dual Transformer-Based Model for Holistic Surgical Scene Segmentation [6.285713987996377]
We introduce Surg-SegFormer, a novel prompt-free model that outperforms current state-of-the-art techniques.<n>By providing robust and automated surgical scene comprehension, this model significantly reduces the tutoring burden on expert surgeons.
arXiv Detail & Related papers (2025-07-06T09:04:25Z)
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [67.8359850515282]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We show that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z)
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z)
SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge [72.97934765570069]
We release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP) The aim of the challenge is to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain. A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation.
arXiv Detail & Related papers (2023-12-31T13:32:18Z)
Visual-Kinematics Graph Learning for Procedure-agnostic Instrument Tip Segmentation in Robotic Surgeries [29.201385352740555]
We propose a novel visual-kinematics graph learning framework to accurately segment the instrument tip given various surgical procedures. Specifically, a graph learning framework is proposed to encode relational features of instrument parts from both image and kinematics. A cross-modal contrastive loss is designed to incorporate robust geometric prior from kinematics to image for tip segmentation.
arXiv Detail & Related papers (2023-09-02T14:52:58Z)
GLSFormer : Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos [57.93194315839009]
We propose a vision transformer-based approach to learn temporal features directly from sequence-level patches. We extensively evaluate our approach on two cataract surgery video datasets, Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods.
arXiv Detail & Related papers (2023-07-20T17:57:04Z)
Next-generation Surgical Navigation: Marker-less Multi-view 6DoF Pose Estimation of Surgical Instruments [64.59698930334012]
We present a multi-camera capture setup consisting of static and head-mounted cameras.<n>Second, we publish a multi-view RGB-D video dataset of ex-vivo spine surgeries, captured in a surgical wet lab and a real operating theatre.<n>Third, we evaluate three state-of-the-art single-view and multi-view methods for the task of 6DoF pose estimation of surgical instruments.
arXiv Detail & Related papers (2023-05-05T13:42:19Z)
Using Hand Pose Estimation To Automate Open Surgery Training Feedback [0.0]
This research aims to facilitate the use of state-of-the-art computer vision algorithms for the automated training of surgeons. By estimating 2D hand poses, we model the movement of the practitioner's hands, and their interaction with surgical instruments.
arXiv Detail & Related papers (2022-11-13T21:47:31Z)
Temporally Guided Articulated Hand Pose Tracking in Surgical Videos [22.752654546694334]
Articulated hand pose tracking is an under-explored problem that carries the potential for use in an extensive number of applications.<n>We propose a novel hand pose estimation model, CondPose, which improves detection and tracking accuracy by incorporating a pose prior to its prediction.
arXiv Detail & Related papers (2021-01-12T03:44:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.