Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation
- URL: http://arxiv.org/abs/2401.09883v1
- Date: Thu, 18 Jan 2024 10:55:13 GMT
- Title: Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation
- Authors: Songhe Deng, Wei Zhuo, Jinheng Xie, Linlin Shen
- Abstract summary: Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation.
We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
- Score: 37.15828464616587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Class Activation Map (CAM) has emerged as a popular tool for weakly
supervised semantic segmentation (WSSS), allowing the localization of object
regions in an image using only image-level labels. However, existing CAM
methods suffer from under-activation of target object regions and
false-activation of background regions due to the fact that a lack of detailed
supervision can hinder the model's ability to understand the image as a whole.
In this paper, we propose a novel Question-Answer Cross-Language-Image Matching
framework for WSSS (QA-CLIMS), leveraging the vision-language foundation model
to maximize the text-based understanding of images and guide the generation of
activation maps. First, a series of carefully designed questions are posed to
the VQA (Visual Question Answering) model with Question-Answer Prompt
Engineering (QAPE) to generate a corpus of both foreground target objects and
backgrounds that are adaptive to query images. We then employ contrastive
learning in a Region Image Text Contrastive (RITC) network to compare the
obtained foreground and background regions with the generated corpus. Our
approach exploits the rich textual information from the open vocabulary as
additional supervision, enabling the model to generate high-quality CAMs with a
more complete object region and reduce false-activation of background regions.
We conduct extensive analysis to validate the proposed method and show that our
approach performs state-of-the-art on both PASCAL VOC 2012 and MS COCO
datasets. Code is available at: https://github.com/CVI-SZU/QA-CLIMS
Related papers
- DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - IFSeg: Image-free Semantic Segmentation via Vision-Language Model [67.62922228676273]
We introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories.
We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens.
Our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods.
arXiv Detail & Related papers (2023-03-25T08:19:31Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Contrastive learning of Class-agnostic Activation Map for Weakly
Supervised Object Localization and Semantic Segmentation [32.76127086403596]
We propose Contrastive learning for Class-agnostic Activation Map (C$2$AM) generation using unlabeled image data.
We form the positive and negative pairs based on the above relations and force the network to disentangle foreground and background.
As the network is guided to discriminate cross-image foreground-background, the class-agnostic activation maps learned by our approach generate more complete object regions.
arXiv Detail & Related papers (2022-03-25T08:46:24Z) - Cross Language Image Matching for Weakly Supervised Semantic
Segmentation [26.04918485403939]
We propose a novel Cross Language Image Matching (CLIMS) framework, based on the Contrastive Language-Image Pre-training (CLIP) model.
The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress closely-related open background regions.
In addition, we design a co-occurring background suppression loss to prevent the model from activating closely-related background regions.
arXiv Detail & Related papers (2022-03-05T06:39:48Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Cross-Image Region Mining with Region Prototypical Network for Weakly
Supervised Segmentation [45.39679291105364]
We propose a region network RPNet to explore the cross-image object diversity of the training set.
Similar object parts across images are identified via region feature comparison.
Experiments show that the proposed method generates more complete and accurate pseudo object masks.
arXiv Detail & Related papers (2021-08-17T02:51:02Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.