Related papers: Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

URL: http://arxiv.org/abs/2510.16371v1
Date: Sat, 18 Oct 2025 06:48:29 GMT
Title: Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis
Authors: Mohammad Javad Ahmadi, Iman Gandomi, Parisa Abdi, Seyed-Farzad Mohammadi, Amirhossein Taslimi, Mehdi Khodaparast, Hassan Hashemi, Mahdi Tavakoli, Hamid D. Taghirad,
Abstract summary: We present a dataset of 3,000 cataract surgery videos from two surgical centers, performed by surgeons with a range of experience levels.<n>This resource is enriched with four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores.<n>The technical quality of the dataset is supported by a series of benchmarking experiments for key surgical AI tasks.
Score: 4.318540086708654
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The development of computer-assisted surgery systems depends on large-scale, annotated datasets. Current resources for cataract surgery often lack the diversity and annotation depth needed to train generalizable deep-learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos from two surgical centers, performed by surgeons with a range of experience levels. This resource is enriched with four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores based on the established competency rubrics like the ICO-OSCAR. The technical quality of the dataset is supported by a series of benchmarking experiments for key surgical AI tasks, including workflow recognition, scene segmentation, and automated skill assessment. Furthermore, we establish a domain adaptation baseline for the phase recognition task by training a model on a subset of surgical centers and evaluating its performance on a held-out center. The dataset and annotations are available in Google Form (https://docs.google.com/forms/d/e/1FAIpQLSfmyMAPSTGrIy2sTnz0-TMw08ZagTimRulbAQcWdaPwDy187A/viewform ?usp=dialog).

Related papers

Data-Efficient Learning for Generalizable Surgical Video Understanding [0.0]
This research aims to bridge gap between deep learning-based surgical video analysis in research and its real-world clinical environments.<n>I benchmarked state-of-the-art neural network architectures to identify the most effective designs for each task.<n>We developed semi-driven frameworks that improve model performance across tasks by leveraging large amounts of unlabeled surgical video.
arXiv Detail & Related papers (2025-08-13T22:00:23Z)
Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z)
SurgXBench: Explainable Vision-Language Model Benchmark for Surgery [4.068223793121694]
Vision-Language Models (VLMs) have brought transformative advances in reasoning across visual and textual modalities.<n>Existing models show limited performance, highlighting the need for benchmark studies to assess their capabilities and limitations.<n>We benchmark the zero-shot performance of several advancedVLMs on two public robotic-assisted laparoscopic datasets for instrument and action classification.
arXiv Detail & Related papers (2025-05-16T00:42:18Z)
Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities [65.66373425605278]
Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events.<n>Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases.<n>This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure.
arXiv Detail & Related papers (2025-04-26T15:37:22Z)
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [60.75854609803651]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.<n>OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.<n>Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z)
Cataract-1K: Cataract Surgery Dataset for Scene Segmentation, Phase Recognition, and Irregularity Detection [5.47960852753243]
We present the largest cataract surgery video dataset that addresses diverse requisites for constructing computerized surgical workflow analysis. We validate the quality of annotations by benchmarking the performance of several state-of-the-art neural network architectures. The dataset and annotations will be publicly available upon acceptance of the paper.
arXiv Detail & Related papers (2023-12-11T10:53:05Z)
Hierarchical Semi-Supervised Learning Framework for Surgical Gesture Segmentation and Recognition Based on Multi-Modality Data [2.8770761243361593]
We develop a hierarchical semi-supervised learning framework for surgical gesture segmentation using multi-modality data. A Transformer-based network with a pre-trained ResNet-18' backbone is used to extract visual features from the surgical operation videos. The proposed approach has been evaluated using data from the publicly available JIGS database, including Suturing, Needle Passing, and Knot Tying tasks.
arXiv Detail & Related papers (2023-07-31T21:17:59Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.<n>These methods rely on manually annotated surgical videos to predict a fixed set of object categories.<n>In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)
Dissecting Self-Supervised Learning Methods for Surgical Computer Vision [51.370873913181605]
Self-Supervised Learning (SSL) methods have begun to gain traction in the general computer vision community. The effectiveness of SSL methods in more complex and impactful domains, such as medicine and surgery, remains limited and unexplored. We present an extensive analysis of the performance of these methods on the Cholec80 dataset for two fundamental and popular tasks in surgical context understanding, phase recognition and tool presence detection.
arXiv Detail & Related papers (2022-07-01T14:17:11Z)
Aggregating Long-Term Context for Learning Laparoscopic and Robot-Assisted Surgical Workflows [40.48632897750319]
We propose a new temporal network structure that leverages task-specific network representation to collect long-term sufficient statistics. We demonstrate superior results over existing and novel state-of-the-art segmentation techniques on two laparoscopic cholecystectomy datasets.
arXiv Detail & Related papers (2020-09-01T20:29:14Z)
m2caiSeg: Semantic Segmentation of Laparoscopic Images using Convolutional Neural Networks [4.926395463398194]
We propose a deep learning based semantic segmentation algorithm to identify and label the tissues and organs in the endoscopic video feed of the human torso region. We present an annotated dataset, m2caiSeg, created from endoscopic video feeds of real-world surgical procedures.
arXiv Detail & Related papers (2020-08-23T23:30:15Z)
LRTD: Long-Range Temporal Dependency based Active Learning for Surgical Workflow Recognition [67.86810761677403]
We propose a novel active learning method for cost-effective surgical video analysis. Specifically, we propose a non-local recurrent convolutional network (NL-RCNet), which introduces non-local block to capture the long-range temporal dependency. We validate our approach on a large surgical video dataset (Cholec80) by performing surgical workflow recognition task.
arXiv Detail & Related papers (2020-04-21T09:21:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.