Related papers: CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

URL: http://arxiv.org/abs/2602.06619v1
Date: Fri, 06 Feb 2026 11:23:17 GMT
Title: CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling
Authors: Yuxin He, An Li, Cheng Xue,
Abstract summary: We propose a causality-inspired vision-language framework that learns domain-invariant representations for surgical phase recognition without access to target domain data.<n>Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features.
Score: 15.329483562973536
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.

Related papers

Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z)
SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation [4.97436124491469]
We introduce a speech-guided collaborative perception framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs.<n>A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation.<n> instruments themselves serve as interactive pointers to label additional elements of the surgical scene.
arXiv Detail & Related papers (2025-09-12T23:36:52Z)
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z)
Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z)
Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks [1.0889037375410424]
Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment.<n>Recent advancements in vision-language models, especially the transformer-based CLIP model, offer promising capabilities for generalizing action recognition from video data.
arXiv Detail & Related papers (2025-07-24T13:13:28Z)
Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance [50.486523249499115]
Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS)<n>We propose Compress-to-Explore (C2E), a novel self-supervised framework to learn compact, informative representations from surgical videos.<n>C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data.
arXiv Detail & Related papers (2025-05-16T14:02:24Z)
SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence [16.584722724845182]
Integration of Vision-Language Models in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies.<n>We present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery.
arXiv Detail & Related papers (2025-03-13T11:23:13Z)
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z)
Hypergraph-Transformer (HGT) for Interactive Event Prediction in Laparoscopic and Robotic Surgery [47.47211257890948]
We propose a predictive neural network that is capable of understanding and predicting critical interactive aspects of surgical workflow from intra-abdominal video.<n>We verify our approach on established surgical datasets and applications, including the detection and prediction of action triplets.<n>Our results demonstrate the superiority of our approach compared to unstructured alternatives.
arXiv Detail & Related papers (2024-02-03T00:58:05Z)
Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers [4.371909393924804]
We introduce a dataset tailored for relevant event recognition in laparoscopic videos. Our dataset includes annotations for critical events associated with major intra-operative challenges and post-operative complications. We evaluate a hybrid transformer architecture coupled with a customized training-inference framework to recognize four specific events in laparoscopic surgery videos.
arXiv Detail & Related papers (2023-12-01T13:57:29Z)
Simulation-to-Real domain adaptation with teacher-student learning for endoscopic instrument segmentation [1.1047993346634768]
We introduce a teacher-student learning approach that learns jointly from annotated simulation data and unlabeled real data. Empirical results on three datasets highlight the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-03-02T09:30:28Z)
Few-shot Medical Image Segmentation using a Global Correlation Network with Discriminative Embedding [60.89561661441736]
We propose a novel method for few-shot medical image segmentation. We construct our few-shot image segmentor using a deep convolutional network trained episodically. We enhance discriminability of deep embedding to encourage clustering of the feature domains of the same class.
arXiv Detail & Related papers (2020-12-10T04:01:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.