Tuning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations
- URL: http://arxiv.org/abs/2501.18474v1
- Date: Thu, 30 Jan 2025 16:48:02 GMT
- Title: Tuning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations
- Authors: Chengxi Zeng, David Smithard, Alberto M Gambaruto, Tilo Burghardt,
- Abstract summary: We propose a novel test-time training paradigm that enhances the performance of foundation models on downstream datasets without requiring full annotations.<n>Specifically, our method employs simple point prompts to guide a test-time semi-self-supervised training task.<n>This approach directly tackles challenges in the medical imaging field, where acquiring annotations is both time-intensive and expensive.
- Score: 1.8142185304787555
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision foundation models have demonstrated exceptional generalization capabilities in segmentation tasks for both generic and specialized images. However, a performance gap persists between foundation models and task-specific, specialized models. Fine-tuning foundation models on downstream datasets is often necessary to bridge this gap. Unfortunately, obtaining fully annotated ground truth for downstream datasets is both challenging and costly. To address this limitation, we propose a novel test-time training paradigm that enhances the performance of foundation models on downstream datasets without requiring full annotations. Specifically, our method employs simple point prompts to guide a test-time semi-self-supervised training task. The model learns by resolving the ambiguity of the point prompt through various augmentations. This approach directly tackles challenges in the medical imaging field, where acquiring annotations is both time-intensive and expensive. We conducted extensive experiments on our new Videofluoroscopy dataset (VFSS-5k) for the instance segmentation task, achieving an average Dice coefficient of 0.868 across 12 anatomies with a single model.
Related papers
- Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation [20.009670139005085]
Existing ultrasound segmentation methods often struggle with adaptability to new tasks.
We introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features.
These enriched features are then decoded to produce precise and robust segmentation.
arXiv Detail & Related papers (2025-03-31T17:47:42Z) - AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation Models [0.10037949839020764]
In colonoscopy, 80% of the missed polyps could be detected with the help of Deep Learning models.
In the search for algorithms capable of addressing this challenge, foundation models emerge as promising candidates.
Their zero-shot or few-shot learning capabilities, facilitate generalization to new data or tasks without extensive fine-tuning.
A comprehensive evaluation of foundation models for polyp segmentation was conducted, assessing both detection and delimitation.
arXiv Detail & Related papers (2025-03-31T14:20:53Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.
Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.
We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - Where's Waldo: Diffusion Features for Personalized Segmentation and Retrieval [31.48981364573974]
Self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods.
A significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented.
We propose a novel approach called PDM for Personalized Features Diffusion Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training.
arXiv Detail & Related papers (2024-05-28T10:13:18Z) - Combating Missing Modalities in Egocentric Videos at Test Time [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues.
We propose a novel approach to address this issue at test time without requiring retraining.
MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z) - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance [68.18779562801762]
multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance.
Our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
arXiv Detail & Related papers (2024-04-04T17:58:02Z) - Few-shot Online Anomaly Detection and Segmentation [29.693357653538474]
This paper focuses on addressing the challenging yet practical few-shot online anomaly detection and segmentation (FOADS) task.
Under the FOADS framework, models are trained on a few-shot normal dataset, followed by inspection and improvement of their capabilities by leveraging unlabeled streaming data containing both normal and abnormal samples simultaneously.
In order to achieve improved performance with limited training samples, we employ multi-scale feature embedding extracted from a CNN pre-trained on ImageNet to obtain a robust representation.
arXiv Detail & Related papers (2024-03-27T02:24:00Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - A Critical Look at the Current Usage of Foundation Model for Dense
Recognition Task [26.938332354370814]
Large model trained on huge amount of cross-modality data, which is usually be termed as foundation model, achieves conspicuous accomplishment in many fields.
It is still unclear whether those foundation models can be applied to other different downstream tasks.
arXiv Detail & Related papers (2023-07-06T08:57:53Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.