Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision
- URL: http://arxiv.org/abs/2507.05020v2
- Date: Thu, 10 Jul 2025 12:21:47 GMT
- Title: Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision
- Authors: Soham Walimbe, Britty Baby, Vinkle Srivastav, Nicolas Padoy,
- Abstract summary: MML-SurgAdapt is a unified multi-task framework to handle diverse surgical tasks through natural language supervision.<n>A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks.<n>Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations.
- Score: 1.890063512530524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: https://github.com/CAMMA-public/MML-SurgAdapt
Related papers
- Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning [46.51245338355645]
Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model.<n>We show that it qualitatively differs from the single-input-multiple-output model merging settings studied in the literature due to the existence of task-specific decoders.<n>We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging.
arXiv Detail & Related papers (2025-04-15T15:10:46Z) - Partially Supervised Unpaired Multi-Modal Learning for Label-Efficient Medical Image Segmentation [53.723234136550055]
We term the new learning paradigm as Partially Supervised Unpaired Multi-Modal Learning (PSUMML)<n>We propose a novel Decomposed partial class adaptation with snapshot Ensembled Self-Training (DEST) framework for it.<n>Our framework consists of a compact segmentation network with modality specific normalization layers for learning with partially labeled unpaired multi-modal data.
arXiv Detail & Related papers (2025-03-07T07:22:42Z) - Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals.<n>Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance.<n>We propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks.
arXiv Detail & Related papers (2024-12-26T18:56:05Z) - MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic [6.46176287368784]
We propose textbfModel textbfExclusive textbfTask textbfArithmetic for merging textbfGPT-scale models.
Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.
arXiv Detail & Related papers (2024-06-17T10:12:45Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - MING-MOE: Enhancing Medical Multi-Task Learning in Large Language Models with Sparse Mixture of Low-Rank Adapter Experts [22.596827147978598]
This paper introduces MING-MOE, a novel Mixture-of-Expert(MOE)-based medical large language model.
It is designed to manage diverse and complex medical tasks without requiring task-specific annotations.
It achieves state-of-the-art (SOTA) performance on over 20 medical tasks, illustrating a significant improvement over existing models.
arXiv Detail & Related papers (2024-04-13T15:28:52Z) - Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers.
Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters.
We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z) - ST-MAML: A Stochastic-Task based Method for Task-Heterogeneous
Meta-Learning [12.215288736524268]
This paper proposes a novel method, ST-MAML, that empowers model-agnostic meta-learning (MAML) to learn from multiple task distributions.
We demonstrate that ST-MAML matches or outperforms the state-of-the-art on two few-shot image classification tasks, one curve regression benchmark, one image completion problem, and a real-world temperature prediction application.
arXiv Detail & Related papers (2021-09-27T18:54:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.