Related papers: CurConMix+: A Unified Spatio-Temporal Framework for Hierarchical Surgical Workflow Understanding

CurConMix+: A Unified Spatio-Temporal Framework for Hierarchical Surgical Workflow Understanding

URL: http://arxiv.org/abs/2601.12312v1
Date: Sun, 18 Jan 2026 08:42:10 GMT
Title: CurConMix+: A Unified Spatio-Temporal Framework for Hierarchical Surgical Workflow Understanding
Authors: Yongjun Jeon, Jongmin Shin, Kanggil Park, Seonmin Park, Soyoung Lim, Jung Yong Kim, Jinsoo Rhu, Jongman Kim, Gyu-Seong Choi, Namkee Oh, Kyu-Hwan Jung,
Abstract summary: Surgical action triplet recognition aims to understand fine surgical behaviors by modeling interactions among instruments, actions, and anatomical targets.<n>Despite its clinical importance for workflow analysis and skill assessment, progress has been hindered by severe class imbalance, subtle visual variations, and the interdependence among triplet components.<n>This study builds upon CurConMix, a spatial representation framework, to address these challenges.
Score: 1.0692208281858493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Surgical action triplet recognition aims to understand fine-grained surgical behaviors by modeling the interactions among instruments, actions, and anatomical targets. Despite its clinical importance for workflow analysis and skill assessment, progress has been hindered by severe class imbalance, subtle visual variations, and the semantic interdependence among triplet components. Existing approaches often address only a subset of these challenges rather than tackling them jointly, which limits their ability to form a holistic understanding. This study builds upon CurConMix, a spatial representation framework. At its core, a curriculum-guided contrastive learning strategy learns discriminative and progressively correlated features, further enhanced by structured hard-pair sampling and feature-level mixup. Its temporal extension, CurConMix+, integrates a Multi-Resolution Temporal Transformer (MRTT) that achieves robust, context-aware understanding by adaptively fusing multi-scale temporal features and dynamically balancing spatio-temporal cues. Furthermore, we introduce LLS48, a new, hierarchically annotated benchmark for complex laparoscopic left lateral sectionectomy, providing step-, task-, and action-level annotations. Extensive experiments on CholecT45 and LLS48 demonstrate that CurConMix+ not only outperforms state-of-the-art approaches in triplet recognition, but also exhibits strong cross-level generalization, as its fine-grained features effectively transfer to higher-level phase and step recognition tasks. Together, the framework and dataset provide a unified foundation for hierarchy-aware, reproducible, and interpretable surgical workflow understanding. The code and dataset will be publicly released on GitHub to facilitate reproducibility and further research.

Related papers

Fully Differentiable Bidirectional Dual-Task Synergistic Learning for Semi-Supervised 3D Medical Image Segmentation [3.9950415168730107]
Semi-supervised learning relaxes the need of large pixel-wise labeled datasets for image segmentation by leveraging unlabeled data.<n>We propose a fully Differentiable Bidirectional Synergistic Learning (DBiSL) framework, which seamlessly integrates and enhances four critical SSL components.
arXiv Detail & Related papers (2026-02-10T03:44:24Z)
Surgical Video Understanding with Label Interpolation [3.880707330499936]
Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons.<n>Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions.<n>We propose a novel framework that combines optical flow-based segmentation label with multi-task learning.
arXiv Detail & Related papers (2025-09-23T08:49:07Z)
MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization [52.149337961205624]
We propose a framework that empowers both inter- and intra-task optimization for surgical triplet recognition.<n>For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$2$D) learning scheme that decomposes representations into task-shared and task-specific components.<n>For intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative ambiguities.
arXiv Detail & Related papers (2025-09-16T09:48:52Z)
Foundation Model for Skeleton-Based Human Action Understanding [56.89025287217221]
This paper presents a Unified Skeleton-based Dense Representation Learning framework.<n>USDRL consists of a Transformer-based Dense Spatio-Temporal (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT)
arXiv Detail & Related papers (2025-08-18T02:42:16Z)
NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding [51.63264715941068]
textbfNEARL-CLIP (iunderlineNteracted quunderlineEry underlineAdaptation with ounderlineRthogonaunderlineL Regularization) is a novel cross-modality interaction VLM-based framework.
arXiv Detail & Related papers (2025-08-06T05:44:01Z)
CmFNet: Cross-modal Fusion Network for Weakly-supervised Segmentation of Medical Images [15.499686354040774]
We propose CmFNet, a novel 3D weakly supervised cross-modal medical image segmentation approach.<n>CmFNet consists of three main components: a modality-specific feature learning network, a cross-modal feature learning network, and a hybrid-supervised learning strategy.<n>Our approach effectively mitigates overfitting, delivering robust segmentation results.
arXiv Detail & Related papers (2025-06-22T14:02:27Z)
SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence [16.584722724845182]
Integration of Vision-Language Models in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies.<n>We present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery.
arXiv Detail & Related papers (2025-03-13T11:23:13Z)
Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition [64.56321246196859]
We propose a novel dyNamically Evolving dUal skeleton-semantic syneRgistic framework.<n>We first construct the spatial-temporal evolving micro-prototypes and integrate dynamic context-aware side information.<n>We introduce the spatial compression and temporal memory mechanisms to guide the growth of spatial-temporal micro-prototypes.
arXiv Detail & Related papers (2024-11-18T05:16:11Z)
Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration [21.97457095780378]
We propose a novel semi-supervised multimodal segmentation framework that is robust to scarce labeled data and misaligned modalities. Our framework employs a novel cross modality collaboration strategy to distill modality-independent knowledge, which is inherently associated with each modality. It also integrates contrastive consistent learning to regulate anatomical structures, facilitating anatomical-wise prediction alignment on unlabeled data.
arXiv Detail & Related papers (2024-08-14T07:34:12Z)
Learning Multiscale Consistency for Self-supervised Electron Microscopy Instance Segmentation [48.267001230607306]
We propose a pretraining framework that enhances multiscale consistency in EM volumes. Our approach leverages a Siamese network architecture, integrating strong and weak data augmentations. It effectively captures voxel and feature consistency, showing promise for learning transferable representations for EM analysis.
arXiv Detail & Related papers (2023-08-19T05:49:13Z)
HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction [24.853265244512954]
We propose a hierarchical contrastive learning Framework for DistantlySupervised relation extraction (HiCLRE) to reduce noisy sentences. Specifically, we propose a three-level hierarchical learning framework to interact with cross levels, generating the de-noising context-aware representations. Experiments demonstrate that HiCLRE significantly outperforms strong baselines in various mainstream DSRE datasets.
arXiv Detail & Related papers (2022-02-27T12:48:26Z)
Provable Hierarchy-Based Meta-Reinforcement Learning [50.17896588738377]
We analyze HRL in the meta-RL setting, where learner learns latent hierarchical structure during meta-training for use in a downstream task. We provide "diversity conditions" which, together with a tractable optimism-based algorithm, guarantee sample-efficient recovery of this natural hierarchy. Our bounds incorporate common notions in HRL literature such as temporal and state/action abstractions, suggesting that our setting and analysis capture important features of HRL in practice.
arXiv Detail & Related papers (2021-10-18T17:56:02Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.