Related papers: Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

URL: http://arxiv.org/abs/2503.24368v1
Date: Mon, 31 Mar 2025 17:47:42 GMT
Title: Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation
Authors: Xiaoran Zhang, Eric Z. Chen, Lin Zhao, Xiao Chen, Yikang Liu, Boris Maihe, James S. Duncan, Terrence Chen, Shanhui Sun,
Abstract summary: Existing ultrasound segmentation methods often struggle with adaptability to new tasks.<n>We introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features.<n>These enriched features are then decoded to produce precise and robust segmentation.
Score: 20.009670139005085
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation. We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20\% on average in the 1\% and 10\% data settings. Our method achieves $\sim$77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.

Related papers

AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models [7.382887784956608]
This paper introduces a zero-shot and automatic segmentation pipeline that combines vision-language and segmentation foundation models.<n>By proper decomposition and test-time adaptation, our fully automatic pipeline performs competitively with weakly-prompted interactive foundation models.
arXiv Detail & Related papers (2025-05-23T14:07:21Z)
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation [30.524999223901645]
We propose an enhanced Segment Anything Model (SAM) framework that utilizes annotation-efficient prompts generated in a fully unsupervised fashion.<n>We adopt the direct preference optimization technique to design an optimal policy that enables the model to generate high-fidelity segmentations.<n>State-of-the-art performance of our framework in tasks such as lung segmentation, breast tumor segmentation, and organ segmentation across various modalities, including X-ray, ultrasound, and abdominal CT, justifies its effectiveness in low-annotation data scenarios.
arXiv Detail & Related papers (2025-03-06T17:28:48Z)
Tuning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations [1.8142185304787555]
We propose a novel test-time training paradigm that enhances the performance of foundation models on downstream datasets without requiring full annotations.<n>Specifically, our method employs simple point prompts to guide a test-time semi-self-supervised training task.<n>This approach directly tackles challenges in the medical imaging field, where acquiring annotations is both time-intensive and expensive.
arXiv Detail & Related papers (2025-01-30T16:48:02Z)
Self-adaptive vision-language model for 3D segmentation of pulmonary artery and vein [18.696258519327095]
This paper proposes a novel framework called Language-guided self-adaptive Cross-Attention Fusion Framework.<n>Our method adopts pre-trained CLIP as a strong feature extractor for generating the segmentation of 3D CT scans.<n>We extensively validate our method on a local dataset, which is the largest pulmonary artery-vein CT dataset to date.
arXiv Detail & Related papers (2025-01-07T12:03:02Z)
PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation. Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process. Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z)
Semantic Segmentation Refiner for Ultrasound Applications with Zero-Shot Foundation Models [1.8142288667655782]
We propose a prompt-less segmentation method harnessing the ability of segmentation foundation models to segment abstract shapes. Our method's advantages are brought to light in experiments on a small-scale musculoskeletal ultrasound images dataset.
arXiv Detail & Related papers (2024-04-25T04:21:57Z)
Dual-scale Enhanced and Cross-generative Consistency Learning for Semi-supervised Medical Image Segmentation [49.57907601086494]
Medical image segmentation plays a crucial role in computer-aided diagnosis. We propose a novel Dual-scale Enhanced and Cross-generative consistency learning framework for semi-supervised medical image (DEC-Seg)
arXiv Detail & Related papers (2023-12-26T12:56:31Z)
ARHNet: Adaptive Region Harmonization for Lesion-aware Augmentation to Improve Segmentation Performance [61.04246102067351]
We propose a foreground harmonization framework (ARHNet) to tackle intensity disparities and make synthetic images look more realistic. We demonstrate the efficacy of our method in improving the segmentation performance using real and synthetic images.
arXiv Detail & Related papers (2023-07-02T10:39:29Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
The Little W-Net That Could: State-of-the-Art Retinal Vessel Segmentation with Minimalistic Models [19.089445797922316]
We show that a minimalistic version of a standard U-Net with several orders of magnitude less parameters closely approximates the performance of current best techniques. We also propose a simple extension, dubbed W-Net, which reaches outstanding performance on several popular datasets. We also test our approach on the Artery/Vein segmentation problem, where we again achieve results well-aligned with the state-of-the-art.
arXiv Detail & Related papers (2020-09-03T19:59:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.