Related papers: SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues

SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues

URL: http://arxiv.org/abs/2406.19364v3
Date: Wed, 25 Sep 2024 12:03:54 GMT
Title: SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues
Authors: Yuxin Xie, Tao Zhou, Yi Zhou, Geng Chen,
Abstract summary: We present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation.
Score: 11.856041847833666
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance. Source code is available at: https://github.com/xyx1024/SimTxtSeg.

Related papers

Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation [48.76848912120607]
Semi-supervised medical image segmentation is a crucial technique for alleviating the high cost of data annotation.<n>We propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg)<n>Our framework consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA)
arXiv Detail & Related papers (2025-07-16T16:29:30Z)
Multimodal Medical Image Binding via Shared Text Embeddings [15.873810726442603]
Multimodal Medical Image Binding with Text (Mtextsuperscript3Bind) is a novel pre-training framework that enables seamless alignment of medical imaging modalities.<n>Mtextsuperscript3Bind first fine-tunes CLIP-like image-text models to align their modality-specific text embedding space.<n>We show that Mtextsuperscript3Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks.
arXiv Detail & Related papers (2025-06-22T15:39:25Z)
Organ-aware Multi-scale Medical Image Segmentation Using Text Prompt Engineering [17.273290949721975]
Existing medical image segmentation methods rely on uni-modal visual inputs, such as images or videos, requiring labor-intensive manual annotations. Medical imaging techniques capture multiple intertwined organs within a single scan, further complicating segmentation accuracy. To address these challenges, MedSAM was developed to enhance segmentation accuracy by integrating image features with user-provided prompts.
arXiv Detail & Related papers (2025-03-18T01:35:34Z)
Text-Promptable Propagation for Referring Medical Image Sequence Segmentation [20.724643106195852]
Ref-MISS aims to segment anatomical structures in medical image sequences based on natural language descriptions. Existing 2D and 3D segmentation models struggle to explicitly track objects of interest across medical image sequences. We propose Text-Promptable Propagation (TPP), a model designed for referring medical image sequence segmentation.
arXiv Detail & Related papers (2025-02-16T12:13:11Z)
Text4Seg: Reimagining Image Segmentation as Text Generation [32.230379277018194]
We introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem. Key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones.
arXiv Detail & Related papers (2024-10-13T14:28:16Z)
Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation [28.24883865053459]
This paper aims to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments.
arXiv Detail & Related papers (2024-04-05T17:25:17Z)
ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image [4.076537350106898]
We present emphScribblePrompt, a flexible neural network based interactive segmentation tool for biomedical imaging. In a user study with domain experts, ScribblePrompt reduced annotation time by 28% while improving Dice by 15% compared to the next best method. We showcase ScribblePrompt in an interactive demo, provide code, and release a dataset of scribble annotations at https://scribbleprompt.csail.mit.edu.
arXiv Detail & Related papers (2023-12-12T15:57:03Z)
Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image. We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z)
Increasing Textual Context Size Boosts Medical Image-Text Matching [7.39915548392375]
We analyze the use of OpenAI's CLIP, a general image-text matching model, and observe that CLIP's limited textual input size has negative impact on downstream performance. We thus train and release ClipMD, which is trained with a simple sliding window technique to encode textual captions. The results show that ClipMD outperforms other models on both datasets by a large margin.
arXiv Detail & Related papers (2023-03-23T15:20:05Z)
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. We propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z)
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation. We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z)
Self-Supervised Correction Learning for Semi-Supervised Biomedical Image Segmentation [84.58210297703714]
We propose a self-supervised correction learning paradigm for semi-supervised biomedical image segmentation. We design a dual-task network, including a shared encoder and two independent decoders for segmentation and lesion region inpainting. Experiments on three medical image segmentation datasets for different tasks demonstrate the outstanding performance of our method.
arXiv Detail & Related papers (2023-01-12T08:19:46Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization [112.68171734288237]
We propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels. We learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of unlabeled images. We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization.
arXiv Detail & Related papers (2021-04-12T21:41:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.