Towards Generalizable Referring Image Segmentation via Target Prompt and
Visual Coherence
- URL: http://arxiv.org/abs/2312.00452v1
- Date: Fri, 1 Dec 2023 09:31:24 GMT
- Title: Towards Generalizable Referring Image Segmentation via Target Prompt and
Visual Coherence
- Authors: Yajie Liu, Pu Ge, Haoxiang Ma, Shichao Fan, Qingjie Liu, Di Huang,
Yunhong Wang
- Abstract summary: Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions.
We present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above.
Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context.
- Score: 48.659338080020746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring image segmentation (RIS) aims to segment objects in an image
conditioning on free-from text descriptions. Despite the overwhelming progress,
it still remains challenging for current approaches to perform well on cases
with various text expressions or with unseen visual entities, limiting its
further application. In this paper, we present a novel RIS approach, which
substantially improves the generalization ability by addressing the two
dilemmas mentioned above. Specially, to deal with unconstrained texts, we
propose to boost a given expression with an explicit and crucial prompt, which
complements the expression in a unified context, facilitating target capturing
in the presence of linguistic style changes. Furthermore, we introduce a
multi-modal fusion aggregation module with visual guidance from a powerful
pretrained model to leverage spatial relations and pixel coherences to handle
the incomplete target masks and false positive irregular clumps which often
appear on unseen visual entities. Extensive experiments are conducted in the
zero-shot cross-dataset settings and the proposed approach achieves consistent
gains compared to the state-of-the-art, e.g., 4.15\%, 5.45\%, and 4.64\% mIoU
increase on RefCOCO, RefCOCO+ and ReferIt respectively, demonstrating its
effectiveness. Additionally, the results on GraspNet-RIS show that our approach
also generalizes well to new scenarios with large domain shifts.
Related papers
- Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation [10.958014189747356]
We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS)
Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets.
It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS.
arXiv Detail & Related papers (2024-07-10T07:14:48Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - IDRNet: Intervention-Driven Relation Network for Semantic Segmentation [34.09179171102469]
Co-occurrent visual patterns suggest that pixel relation modeling facilitates dense prediction tasks.
Despite the impressive results, existing paradigms often suffer from inadequate or ineffective contextual information aggregation.
We propose a novel textbfIntervention-textbfDriven textbfRelation textbfNetwork.
arXiv Detail & Related papers (2023-10-16T18:37:33Z) - Prompting Diffusion Representations for Cross-Domain Semantic
Segmentation [101.04326113360342]
diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation.
We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head.
arXiv Detail & Related papers (2023-07-05T09:28:25Z) - Towards Effective Image Manipulation Detection with Proposal Contrastive
Learning [61.5469708038966]
We propose Proposal Contrastive Learning (PCL) for effective image manipulation detection.
Our PCL consists of a two-stream architecture by extracting two types of global features from RGB and noise views respectively.
Our PCL can be easily adapted to unlabeled data in practice, which can reduce manual labeling costs and promote more generalizable features.
arXiv Detail & Related papers (2022-10-16T13:30:13Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.