Related papers: Test-time Contrastive Concepts for Open-world Semantic Segmentation

Test-time Contrastive Concepts for Open-world Semantic Segmentation

URL: http://arxiv.org/abs/2407.05061v2
Date: Fri, 24 Jan 2025 21:51:57 GMT
Title: Test-time Contrastive Concepts for Open-world Semantic Segmentation
Authors: Monika Wysoczańska, Antonin Vobecky, Amaia Cardiel, Tomasz Trzciński, Renaud Marlet, Andrei Bursuc, Oriane Siméoni,
Abstract summary: Recent CLIP-like Vision-Language Models (VLMs), pre-trained on large amounts of image-text pairs, have paved the way to open-vocabulary semantic segmentation.<n>We propose two different approaches to automatically generate, at test time, textual contrastive concepts that are query-specific.
Score: 14.899741072838994
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent CLIP-like Vision-Language Models (VLMs), pre-trained on large amounts of image-text pairs to align both modalities with a simple contrastive objective, have paved the way to open-vocabulary semantic segmentation. Given an arbitrary set of textual queries, image pixels are assigned the closest query in feature space. However, this works well when a user exhaustively lists all possible visual concepts in an image, which contrast against each other for the assignment. This corresponds to the current evaluation setup in the literature which relies on having access to a list of in-domain relevant concepts, typically classes of a benchmark dataset. Here, we consider the more challenging (and realistic) scenario of segmenting a single concept, given a textual prompt and nothing else. To achieve good results, besides contrasting with the generic $\textit{background}$ text, we propose two different approaches to automatically generate, at test time, textual contrastive concepts that are query-specific. We do so by leveraging the distribution of text in the VLM's training set or crafted LLM prompts. We also propose a metric designed to evaluate this scenario and show the relevance of our approach to commonly used datasets.

Related papers

InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method for semantic segmentation. We introduce Contrastive Soft Clustering to align masks with the image's structure information. InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z)
Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models [21.17975741743583]
It has recently been discovered that using a pre-trained vision-language model (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions can significantly enhance zero-shot performance. In this paper, we empirically find that the finer descriptions tend to align more effectively with local areas of the query image rather than the whole image.
arXiv Detail & Related papers (2024-06-05T04:08:41Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector. It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z)
Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs [10.484851004093919]
We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts. We propose a novel Text-grounded Contrastive Learning framework that enables a model to directly learn region-text alignment.
arXiv Detail & Related papers (2022-12-01T18:59:03Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching [10.992151305603267]
We propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance. We incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss.
arXiv Detail & Related papers (2021-10-06T09:54:28Z)
Text-based Person Search in Full Images via Semantic-Driven Proposal Generation [42.25611020956918]
We propose a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks. To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals.
arXiv Detail & Related papers (2021-09-27T11:42:40Z)
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding [40.24656027709833]
We propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
arXiv Detail & Related papers (2021-04-26T17:55:33Z)
Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text. We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images. In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z)
Evaluating Multimodal Representations on Visual Semantic Textual Similarity [22.835699807110018]
We present a novel task, Visual Semantic Textual Similarity (vSTS), where such inference ability can be tested directly. Our experiments using simple multimodal representations show that the addition of image representations produces better inference, compared to text-only representations. Our work shows, for the first time, the successful contribution of visual information to textual inference, with ample room for more complex multimodal representation options.
arXiv Detail & Related papers (2020-04-04T09:03:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.