Related papers: R2SM: Referring and Reasoning for Selective Masks

R2SM: Referring and Reasoning for Selective Masks

URL: http://arxiv.org/abs/2506.01795v1
Date: Mon, 02 Jun 2025 15:36:31 GMT
Title: R2SM: Referring and Reasoning for Selective Masks
Authors: Yu-Lin Shih, Wei-En Tai, Cheng Sun, Yu-Chiang Frank Wang, Hwann-Tzong Chen,
Abstract summary: We introduce a new task, Referring and Reasoning for Selective Masks (R2SM)<n>This task extends text-guided segmentation by incorporating mask-type selection driven by user intent.<n>We present the R2SM dataset, constructed by augmenting annotations of COCOA-cls, D2SA, and MUVA.
Score: 35.150696061791805
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce a new task, Referring and Reasoning for Selective Masks (R2SM), which extends text-guided segmentation by incorporating mask-type selection driven by user intent. This task challenges vision-language models to determine whether to generate a modal (visible) or amodal (complete) segmentation mask based solely on natural language prompts. To support the R2SM task, we present the R2SM dataset, constructed by augmenting annotations of COCOA-cls, D2SA, and MUVA. The R2SM dataset consists of both modal and amodal text queries, each paired with the corresponding ground-truth mask, enabling model finetuning and evaluation for the ability to segment images as per user intent. Specifically, the task requires the model to interpret whether a given prompt refers to only the visible part of an object or to its complete shape, including occluded regions, and then produce the appropriate segmentation. For example, if a prompt explicitly requests the whole shape of a partially hidden object, the model is expected to output an amodal mask that completes the occluded parts. In contrast, prompts without explicit mention of hidden regions should generate standard modal masks. The R2SM benchmark provides a challenging and insightful testbed for advancing research in multimodal reasoning and intent-aware segmentation.

Related papers

Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z)
Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z)
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval [13.296362770269452]
Mask-aware TIR (MaTIR) aims to find relevant images based on a textual query.<n>We propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding.<n>We evaluate our approach on COCO and D$3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.
arXiv Detail & Related papers (2025-06-28T12:19:49Z)
Refer to Anything with Vision-Language Prompts [43.00233077605867]
"Refer to Any Mask Group" (RAS) augments segmentation models with complex multimodal interactions and comprehension.<n>We demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks.
arXiv Detail & Related papers (2025-06-05T17:59:51Z)
LlamaSeg: Image Segmentation via Autoregressive Mask Generation [46.17509085054758]
We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions.<n>We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs.
arXiv Detail & Related papers (2025-05-26T02:22:41Z)
Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation [21.30568336073013]
We tackle the challenge of open-vocabulary segmentation, where we need to identify objects from a wide range of categories in different environments.<n>Existing methods often use multi-modal models like CLIP, which combine image and text features in a shared embedding space.<n>We propose Prompt-guided Mask Proposal (PMP) where the mask generator takes the input text prompts and generates masks guided by these prompts.
arXiv Detail & Related papers (2024-12-13T17:22:50Z)
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z)
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling [80.85164509232261]
We propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer. To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling (MRefM) Within MRefM, we propose a referring-aware dynamic image masking strategy that is aware of the referred region.
arXiv Detail & Related papers (2024-10-10T15:18:19Z)
Pluralistic Salient Object Detection [108.74650817891984]
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image. We present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics.
arXiv Detail & Related papers (2024-09-04T01:38:37Z)
DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries [14.435906383301555]
We propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow. Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions. We also introduce a query-oriented mask decoder to decode corresponding segmentation masks.
arXiv Detail & Related papers (2024-08-28T14:14:33Z)
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework. We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise. We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z)
GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z)
Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions [0.0]
We develop a model that comprehends a natural language instruction and generates a segmentation mask for the target everyday object. We build a new dataset based on the well-known Matterport3D and REVERIE datasets. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.
arXiv Detail & Related papers (2023-07-17T16:07:07Z)
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling. Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z)
A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets. We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.