PVPUFormer: Probabilistic Visual Prompt Unified Transformer for Interactive Image Segmentation
- URL: http://arxiv.org/abs/2306.06656v2
- Date: Sun, 03 Nov 2024 11:46:02 GMT
- Title: PVPUFormer: Probabilistic Visual Prompt Unified Transformer for Interactive Image Segmentation
- Authors: Xu Zhang, Kailun Yang, Jiacheng Lin, Jin Yuan, Zhiyong Li, Shutao Li,
- Abstract summary: This paper proposes a simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation.
We first propose a Probabilistic Prompt-unified (PPuE) to generate a unified one-dimensional vector by exploring both prompt and non-prompt information.
We then present a Prompt-to-Pixel Contrastive (P$2$C) loss to accurately align both prompt and pixel features, bridging the representation gap between them.
- Score: 28.033243651780214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integration of diverse visual prompts like clicks, scribbles, and boxes in interactive image segmentation significantly facilitates users' interaction as well as improves interaction efficiency. However, existing studies primarily encode the position or pixel regions of prompts without considering the contextual areas around them, resulting in insufficient prompt feedback, which is not conducive to performance acceleration. To tackle this problem, this paper proposes a simple yet effective Probabilistic Visual Prompt Unified Transformer (PVPUFormer) for interactive image segmentation, which allows users to flexibly input diverse visual prompts with the probabilistic prompt encoding and feature post-processing to excavate sufficient and robust prompt features for performance boosting. Specifically, we first propose a Probabilistic Prompt-unified Encoder (PPuE) to generate a unified one-dimensional vector by exploring both prompt and non-prompt contextual information, offering richer feedback cues to accelerate performance improvement. On this basis, we further present a Prompt-to-Pixel Contrastive (P$^2$C) loss to accurately align both prompt and pixel features, bridging the representation gap between them to offer consistent feature representations for mask prediction. Moreover, our approach designs a Dual-cross Merging Attention (DMA) module to implement bidirectional feature interaction between image and prompt features, generating notable features for performance improvement. A comprehensive variety of experiments on several challenging datasets demonstrates that the proposed components achieve consistent improvements, yielding state-of-the-art interactive segmentation performance. Our code is available at https://github.com/XuZhang1211/PVPUFormer.
Related papers
- LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation [41.77434289193232]
We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP)
LoR-VP enables shared and patch-specific information across rows and columns of image pixels.
Experiments demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods.
arXiv Detail & Related papers (2025-02-02T20:10:48Z) - FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting [18.708185548091716]
FRAP is a simple, yet effective approach based on adaptively adjusting the per-token prompt weights to improve prompt-image alignment and authenticity of the generated images.
We show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets.
We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment.
arXiv Detail & Related papers (2024-08-21T15:30:35Z) - LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition [17.388776062997813]
We try to build a discriminative global representations by fusing image data and text descriptions of the the visual scene.
The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images.
Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion.
arXiv Detail & Related papers (2024-07-09T10:15:31Z) - iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection [34.20778042463112]
We propose a novel visual prompt tuning (VPT) approach, textbfiVPT.
It incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information.
Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens.
arXiv Detail & Related papers (2024-04-08T05:23:12Z) - ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy
in Transformer [88.61312640540902]
We introduce Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter)
Our model achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.
Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-20T03:22:23Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - InterFormer: Real-time Interactive Image Segmentation [80.45763765116175]
Interactive image segmentation enables annotators to efficiently perform pixel-level annotation for segmentation tasks.
The existing interactive segmentation pipeline suffers from inefficient computations of interactive models.
We propose a method named InterFormer that follows a new pipeline to address these issues.
arXiv Detail & Related papers (2023-04-06T08:57:00Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Interactive Face Video Coding: A Generative Compression Framework [18.26476468644723]
We propose a novel framework for Interactive Face Video Coding (IFVC), which allows humans to interact with the intrinsic visual representations instead of the signals.
The proposed solution enjoys several distinct advantages, including ultra-compact representation, low delay interaction, and vivid expression and headpose animation.
arXiv Detail & Related papers (2023-02-20T11:24:23Z) - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z) - Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR)
We study the interaction paradigm in depth, where we find that its computation can be split into two terms.
We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.