UGround: Towards Unified Visual Grounding with Unrolled Transformers
- URL: http://arxiv.org/abs/2510.03853v1
- Date: Sat, 04 Oct 2025 15:56:52 GMT
- Title: UGround: Towards Unified Visual Grounding with Unrolled Transformers
- Authors: Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou,
- Abstract summary: We present UGround, a textbfUnified visual textbfGrounding paradigm that dynamically selects intermediate layers across textbfUnrolled transformers as mask as prompt''<n>Central to UGround is Policy-Prompted Masking, which comprises two key components: Skip Connection (SSC) and Mask as Prompt (MasP)
- Score: 42.58167803005241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.
Related papers
- Seg-VAR: Image Segmentation with Visual Autoregressive Modeling [60.79579744943664]
We propose a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem.<n>This is achieved by replacing the discriminative learning with the latent learning process.<n>Our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens, and (3) a decoder reconstructing masks from these latents.
arXiv Detail & Related papers (2025-11-16T13:36:19Z) - A Simple yet Powerful Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation [6.712332323439369]
We propose a training-free Camouflaged Object pipeline that explicitly converts a task-generic prompt into fine-grained instance masks.<n>The proposed IAPF significantly surpasses existing state-of-the-art training-free COS methods.
arXiv Detail & Related papers (2025-08-09T09:35:32Z) - Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder [5.57393627015653]
Video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models.<n>This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy.<n>We propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2.
arXiv Detail & Related papers (2025-06-28T13:30:36Z) - Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation [9.862714096455175]
We propose a novel training-free test-time adaptation framework that synergizes textbfRegion-constrained textbfDual-stream textbfVisual textbfPrompting (RDVP) via textbfMultimodal textbfStepwise textbfDecomposition Chain of Thought (MSD-CoT)<n>RDVP injects spatial constraints into visual and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and
arXiv Detail & Related papers (2025-06-07T14:50:26Z) - Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization [54.91271106816616]
We propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task.<n>First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt.<n> Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask.
arXiv Detail & Related papers (2025-05-08T02:44:53Z) - High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation [109.19165503929992]
We present MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP.<n>After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets.<n>We achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets.
arXiv Detail & Related papers (2024-12-16T05:44:45Z) - Automatic Generation of Semantic Parts for Face Image Synthesis [7.728916126705043]
We describe a network architecture to address the problem of automatically manipulating or generating the shape of object classes in semantic segmentation masks.
Our proposed model allows embedding the mask class-wise into a latent space where each class embedding can be independently edited.
We report quantitative and qualitative results on the Celeb-MaskHQ dataset, which show our model can both faithfully reconstruct and modify a segmentation mask at the class level.
arXiv Detail & Related papers (2023-07-11T15:01:42Z) - Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part.
We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge.
Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z) - SeCGAN: Parallel Conditional Generative Adversarial Networks for Face
Editing via Semantic Consistency [50.04141606856168]
We propose a label-guided cGAN for editing face images utilising semantic information without the need to specify target semantic masks.
SeCGAN has two branches of generators and discriminators operating in parallel, with one trained to translate RGB images and the other for semantic masks.
Our results on CelebA and CelebA-HQ demonstrate that our approach is able to generate facial images with more accurate attributes.
arXiv Detail & Related papers (2021-11-17T18:54:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.