CREPE: Learnable Prompting With CLIP Improves Visual Relationship
Prediction
- URL: http://arxiv.org/abs/2307.04838v2
- Date: Wed, 19 Jul 2023 15:59:03 GMT
- Title: CREPE: Learnable Prompting With CLIP Improves Visual Relationship
Prediction
- Authors: Rakshith Subramanyam, T. S. Jayram, Rushil Anirudh and Jayaraman J.
Thiagarajan
- Abstract summary: We explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships.
Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge.
We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene.
- Score: 30.921126445357118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore the potential of Vision-Language Models (VLMs),
specifically CLIP, in predicting visual object relationships, which involves
interpreting visual features from images into language-based relations. Current
state-of-the-art methods use complex graphical models that utilize language
cues and visual features to address this challenge. We hypothesize that the
strong language priors in CLIP embeddings can simplify these graphical models
paving for a simpler approach. We adopt the UVTransE relation prediction
framework, which learns the relation as a translational embedding with subject,
object, and union box embeddings from a scene. We systematically explore the
design of CLIP-based subject, object, and union-box representations within the
UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate
Estimation). CREPE utilizes text-based representations for all three bounding
boxes and introduces a novel contrastive training strategy to automatically
infer the text prompt for union-box. Our approach achieves state-of-the-art
performance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual
Genome benchmark, achieving a 15.3\% gain in performance over recent
state-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in
object relation prediction and encourages further research on VLMs in this
challenging domain.
Related papers
- Object-centric Binding in Contrastive Language-Image Pretraining [9.376583779399834]
We propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations.
Our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives.
Our resulting model paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
arXiv Detail & Related papers (2025-02-19T21:30:51Z) - Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP [19.697857943845012]
We propose a framework to learn class-specific vision prototypes in vision space with the help of text prototypes.
We also propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes.
Our proposed framework achieves state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2024-12-27T13:55:11Z) - Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation [14.82606425343802]
Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations.
Existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment.
We propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information.
arXiv Detail & Related papers (2024-12-26T02:12:37Z) - Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition [36.59116507158687]
We introduce a unified framework of Contrastive Learning and Masked Image Modeling for STR (RCMSTR)
The proposed RCMSTR demonstrates superior performance in various STR-related downstream tasks, outperforming the existing state-of-the-art self-supervised STR techniques.
arXiv Detail & Related papers (2024-11-18T01:11:47Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - CLIP-based Synergistic Knowledge Transfer for Text-based Person
Retrieval [66.93563107820687]
We introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for Person Retrieval (TPR)
To explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections.
CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model.
arXiv Detail & Related papers (2023-09-18T05:38:49Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.