PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation
- URL: http://arxiv.org/abs/2511.21984v1
- Date: Wed, 26 Nov 2025 23:49:44 GMT
- Title: PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation
- Authors: Xuchen Li, Hengrui Gu, Mohan Zhang, Qin Liu, Zhen Tan, Xinyuan Zhu, Huixue Zhou, Tianlong Chen, Kaixiong Zhou,
- Abstract summary: PPBoost transforms weak text-derived signals into strong, spatially grounded visual prompts.<n>It operates under a strict zero-shot regime with no image- or pixel-level segmentation labels.<n>It consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines.
- Score: 56.238478239463575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-prompted foundation models for medical image segmentation offer an intuitive way to delineate anatomical structures from natural language queries, but their predictions often lack spatial precision and degrade under domain shift. In contrast, visual-prompted models achieve strong segmentation performance across diverse modalities by leveraging spatial cues of precise bounding-box (bbox) prompts to guide the segmentation of target lesions. However, it is costly and challenging to obtain the precise visual prompts in clinical practice. We propose PPBoost (Progressive Prompt-Boosting), a framework that bridges these limitations by transforming weak text-derived signals into strong, spatially grounded visual prompts, operating under a strict zero-shot regime with no image- or pixel-level segmentation labels. PPBoost first uses a vision-language model to produce initial pseudo-bboxes conditioned on the textual object descriptions and applies an uncertainty-aware criterion to filter unreliable predictions. The retained image-bboxes pairs are then leveraged to train a pseudo-labeled detector, producing the high-quality bboxes for the query images. During inference, PPBoost further refines the generated bboxes by appropriately expanding them to tightly cover the target anatomical structures. The enhanced spatially-grounding bbox prompts guide existing segmentation models to generate final dense masks, effectively amplifying weak text cues into strong spatial guidance. Across three datasets spanning diverse modalities and anatomies, PPBoost consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines and, notably, surpasses few-shot segmentation models without using labeled data. PPBoost can generalize to multiple typical visual segmentation model backbones.
Related papers
- Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment [66.80402022104074]
We propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich.<n>This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA)
arXiv Detail & Related papers (2026-02-01T14:35:46Z) - ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation [21.87321809019825]
Referring Expression (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions.<n>textbfmodel is a novel RES framework integrating textbfEntropy-textbfBased Point textbfDiscovery (textbfEBD) and textbfVision-textbfBased textbfReasoning (textbfVBR)<n>model implements a coarse-to
arXiv Detail & Related papers (2026-01-23T01:56:04Z) - SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning [53.638998508418545]
This paper introduces a new task Image Collaborative and Captioning'' (SegCaptioning)<n>SegCaptioning aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs.<n>This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks.
arXiv Detail & Related papers (2025-12-01T18:33:04Z) - Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z) - GS: Generative Segmentation via Label Diffusion [59.380173266566715]
Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions.<n>Recent diffusion models have been introduced to this domain, but existing approaches remain image-centric.<n>We propose GS (Generative Label), a novel framework that formulates segmentation itself as a generative task.<n> Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.
arXiv Detail & Related papers (2025-08-27T16:28:15Z) - Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z) - DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation [16.64056234334767]
Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions at the pixel level.<n>Current methods utilize text embeddings from pre-trained vision-language models like CLIP.<n>We propose a dual prompting framework, DPSeg, for this task.
arXiv Detail & Related papers (2025-05-16T20:25:42Z) - VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z) - Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection [17.590853105242864]
vision-language models (e.g. CLIP) have demonstrated remarkable performance in zero-shot anomaly detection (ZSAD)<n>Bayes-PFL is designed to learn both image-specific and image-agnostic distributions, which are jointly utilized to regularize the text prompt space and improve the model's generalization on unseen categories.<n>Experiments on 15 industrial and medical datasets demonstrate our method's superior performance.
arXiv Detail & Related papers (2025-03-13T06:05:35Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method that tackles open-vocabulary semantic segmentation.<n>We introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information.<n>InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.