Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2404.00262v1
- Date: Sat, 30 Mar 2024 06:29:59 GMT
- Title: Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation
- Authors: Yuan Wang, Rui Sun, Naisong Luo, Yuwen Pan, Tianzhu Zhang,
- Abstract summary: Open-vocabulary semantic segmentation (OVS) aims to segment images of arbitrary categories specified by class labels or captions.
Previous best-performing methods suffer from false matches between image features and category labels.
We propose a novel relation-aware intra-modal matching framework for OVS based on visual foundation models.
- Score: 36.992698016947486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary semantic segmentation (OVS) aims to segment images of arbitrary categories specified by class labels or captions. However, most previous best-performing methods, whether pixel grouping methods or region recognition methods, suffer from false matches between image features and category labels. We attribute this to the natural gap between the textual features and visual features. In this work, we rethink how to mitigate false matches from the perspective of image-to-image matching and propose a novel relation-aware intra-modal matching (RIM) framework for OVS based on visual foundation models. RIM achieves robust region classification by firstly constructing diverse image-modal reference features and then matching them with region features based on relation-aware ranking distribution. The proposed RIM enjoys several merits. First, the intra-modal reference features are better aligned, circumventing potential ambiguities that may arise in cross-modal matching. Second, the ranking-based matching process harnesses the structure information implicit in the inter-class relationships, making it more robust than comparing individually. Extensive experiments on three benchmarks demonstrate that RIM outperforms previous state-of-the-art methods by large margins, obtaining a lead of more than 10% in mIoU on PASCAL VOC benchmark.
Related papers
- Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models [21.17975741743583]
It has recently been discovered that using a pre-trained vision-language model (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions can significantly enhance zero-shot performance.
In this paper, we empirically find that the finer descriptions tend to align more effectively with local areas of the query image rather than the whole image.
arXiv Detail & Related papers (2024-06-05T04:08:41Z) - Vocabulary-free Image Classification and Semantic Segmentation [71.78089106671581]
We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
arXiv Detail & Related papers (2024-04-16T19:27:21Z) - Hierarchical Matching and Reasoning for Multi-Query Image Retrieval [113.44470784756308]
We propose a novel Hierarchical Matching and Reasoning Network (HMRN) for Multi-Query Image Retrieval (MQIR)
It disentangles MQIR into three hierarchical semantic representations, which is responsible to capture fine-grained local details, contextual global scopes, and high-level inherent correlations.
Our HMRN substantially surpasses the current state-of-the-art methods.
arXiv Detail & Related papers (2023-06-26T07:03:56Z) - Target-oriented Sentiment Classification with Sequential Cross-modal
Semantic Graph [27.77392307623526]
Multi-modal aspect-based sentiment classification (MABSC) is task of classifying the sentiment of a target entity mentioned in a sentence and an image.
Previous methods failed to account for the fine-grained semantic association between the image and the text.
We propose a new approach called SeqCSG, which enhances the encoder-decoder sentiment classification framework using sequential cross-modal semantic graphs.
arXiv Detail & Related papers (2022-08-19T16:04:29Z) - DGSS : Domain Generalized Semantic Segmentation using Iterative Style
Mining and Latent Representation Alignment [38.05196030226661]
Current state-of-the-art (SoTA) have proposed different mechanisms to bridge the domain gap, but they still perform poorly in low illumination conditions.
We propose a two-step framework wherein we first identify an adversarial style that maximizes the domain gap between stylized and source images.
We then propose a style mixing mechanism wherein the same objects from different styles are mixed to construct a new training image.
arXiv Detail & Related papers (2022-02-26T13:54:57Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Semantic Distribution-aware Contrastive Adaptation for Semantic
Segmentation [50.621269117524925]
Domain adaptive semantic segmentation refers to making predictions on a certain target domain with only annotations of a specific source domain.
We present a semantic distribution-aware contrastive adaptation algorithm that enables pixel-wise representation alignment.
We evaluate SDCA on multiple benchmarks, achieving considerable improvements over existing algorithms.
arXiv Detail & Related papers (2021-05-11T13:21:25Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.