fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models
- URL: http://arxiv.org/abs/2503.19670v1
- Date: Tue, 25 Mar 2025 13:57:02 GMT
- Title: fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models
- Authors: Saurav Sharma, Didier Mutter, Nicolas Padoy,
- Abstract summary: We propose fine-CLIP, which learns object-centric features and lever- ages the hierarchy in triplet formulation.<n> fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.
- Score: 3.8352069691069084
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While vision-language models like CLIP have advanced zero-shot surgical phase recognition, they struggle with fine-grained surgical activities, especially action triplets. This limitation arises because current CLIP formulations rely on global image features, which overlook the fine-grained semantics and contextual details crucial for complex tasks like zero-shot triplet recognition. Furthermore, these models do not explore the hierarchical structure inherent in triplets, reducing their ability to generalize to novel triplets. To address these challenges, we propose fine-CLIP, which learns object-centric features and lever- ages the hierarchy in triplet formulation. Our approach integrates three components: hierarchical prompt modeling to capture shared semantics, LoRA-based vision backbone adaptation for enhanced feature extraction, and a graph-based condensation strategy that groups similar patch features into meaningful object clusters. Since triplet classification is a challenging task, we introduce an alternative yet meaningful base-to-novel generalization benchmark with two settings on the CholecT50 dataset: Unseen-Target, assessing adaptability to triplets with novel anatomical structures, and Unseen-Instrument-Verb, where models need to generalize to novel instrument-verb interactions. fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.
Related papers
- Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning [21.488599805772054]
Compositional zero-shot learning aims to recognize novel compositions of attributes and objects learned from seen compositions.
Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object)
We propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL.
arXiv Detail & Related papers (2024-11-18T07:55:54Z) - Surgical Triplet Recognition via Diffusion Model [59.50938852117371]
Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms.
We propose Difft, a new generative framework for surgical triplet recognition employing the diffusion model.
Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition.
arXiv Detail & Related papers (2024-06-19T04:43:41Z) - Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system.
Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context.
Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z) - Dual-Modal Prompting for Sketch-Based Image Retrieval [76.12076969949062]
We propose a dual-modal CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed.
We employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales.
Our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot method by 7.3% in Acc.@1 on the Sketchy dataset.
arXiv Detail & Related papers (2024-04-29T13:43:49Z) - Surgical Action Triplet Detection by Mixed Supervised Learning of
Instrument-Tissue Interactions [5.033722555649178]
Surgical action triplets describe instrument-tissue interactions as (instrument, verb, target) combinations.
This work focuses on surgical action triplet detection, which is challenging but more precise than the traditional triplet recognition task.
We propose MCIT-IG, a two-stage network, that stands for Multi-Class Instrument-aware Transformer-Interaction Graph.
arXiv Detail & Related papers (2023-07-18T18:47:48Z) - Language-free Compositional Action Generation via Decoupling Refinement [67.50452446686725]
We introduce a novel framework to generate compositional actions without reliance on language auxiliaries.
Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement.
arXiv Detail & Related papers (2023-07-07T12:00:38Z) - Triplet Contrastive Learning for Unsupervised Vehicle Re-identification [55.445358749042384]
Part feature learning is a critical technology for fine semantic understanding in vehicle re-identification.
We propose a novel Triplet Contrastive Learning framework (TCL) which leverages cluster features to bridge the part features and global features.
arXiv Detail & Related papers (2023-01-23T15:52:12Z) - Spatio-temporal Relation Modeling for Few-shot Action Recognition [100.3999454780478]
We propose a few-shot action recognition framework, STRM, which enhances class-specific featureriminability while simultaneously learning higher-order temporal representations.
Our approach achieves an absolute gain of 3.5% in classification accuracy, as compared to the best existing method in the literature.
arXiv Detail & Related papers (2021-12-09T18:59:14Z) - Rendezvous: Attention Mechanisms for the Recognition of Surgical Action
Triplets in Endoscopic Videos [12.725586100227337]
Action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities.
We introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels.
Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.
arXiv Detail & Related papers (2021-09-07T17:52:52Z) - Learning Embeddings for Image Clustering: An Empirical Study of Triplet
Loss Approaches [10.42820615166362]
We evaluate two different image clustering objectives, k-means clustering and correlation clustering, in the context of Triplet Loss induced feature space embeddings.
We train a convolutional neural network to learn discriminative features by optimizing two popular versions of the Triplet Loss.
We propose a new, simple Triplet Loss formulation, which shows desirable properties with respect to formal clustering objectives and outperforms the existing methods.
arXiv Detail & Related papers (2020-07-06T23:38:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.