Extending CLIP's Image-Text Alignment to Referring Image Segmentation
- URL: http://arxiv.org/abs/2306.08498v2
- Date: Sun, 7 Apr 2024 07:50:37 GMT
- Title: Extending CLIP's Image-Text Alignment to Referring Image Segmentation
- Authors: Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, Suha Kwak,
- Abstract summary: Referring Image (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression.
We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS.
- Score: 48.26552693472177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.
Related papers
- Fully Aligned Network for Referring Image Segmentation [22.40918154209717]
This paper focuses on the Referring Image task, which aims to segment objects from an image based on a given language description.
The critical problem of RIS is achieving fine-grained alignment between different modalities to recognize and segment the target object.
We present a Fully Aligned Network (FAN) that follows four cross-modal interaction principles.
arXiv Detail & Related papers (2024-09-29T06:13:34Z) - Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning [11.033050922826934]
We introduce SpLIP, a novel multi-modal prompt learning scheme designed to operate with frozen CLIP backbones.
SpLIP implements a bi-directional prompt-sharing strategy that enables mutual knowledge exchange between CLIP's visual and textual encoders.
We propose two innovative strategies for further refining the embedding space.
arXiv Detail & Related papers (2024-07-05T01:30:42Z) - Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation [13.924553294859315]
Point PrompTing (PPT) is a point generator that harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability.
PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU.
arXiv Detail & Related papers (2024-04-18T08:46:12Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image
Segmentation [29.885991324519463]
We propose a novel cross-modality masked self-distillation framework named CM-MaskSD.
Our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment.
Our framework can considerably boost model performance in a nearly parameter-free manner.
arXiv Detail & Related papers (2023-05-19T07:17:27Z) - CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP.
Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.