CREPE: Learnable Prompting With CLIP Improves Visual Relationship
  Prediction
        - URL: http://arxiv.org/abs/2307.04838v2
- Date: Wed, 19 Jul 2023 15:59:03 GMT
- Title: CREPE: Learnable Prompting With CLIP Improves Visual Relationship
  Prediction
- Authors: Rakshith Subramanyam, T. S. Jayram, Rushil Anirudh and Jayaraman J.
  Thiagarajan
- Abstract summary: We explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships.
Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge.
We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene.
- Score: 30.921126445357118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   In this paper, we explore the potential of Vision-Language Models (VLMs),
specifically CLIP, in predicting visual object relationships, which involves
interpreting visual features from images into language-based relations. Current
state-of-the-art methods use complex graphical models that utilize language
cues and visual features to address this challenge. We hypothesize that the
strong language priors in CLIP embeddings can simplify these graphical models
paving for a simpler approach. We adopt the UVTransE relation prediction
framework, which learns the relation as a translational embedding with subject,
object, and union box embeddings from a scene. We systematically explore the
design of CLIP-based subject, object, and union-box representations within the
UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate
Estimation). CREPE utilizes text-based representations for all three bounding
boxes and introduces a novel contrastive training strategy to automatically
infer the text prompt for union-box. Our approach achieves state-of-the-art
performance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual
Genome benchmark, achieving a 15.3\% gain in performance over recent
state-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in
object relation prediction and encourages further research on VLMs in this
challenging domain.
 
      
        Related papers
        - Unify Graph Learning with Text: Unleashing LLM Potentials for Session   Search [35.20525123189316]
 Session search involves a series of interactive queries and actions to fulfill user's complex information need.<n>Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions.<n>We propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches.
 arXiv  Detail & Related papers  (2025-05-20T10:05:06Z)
- METOR: A Unified Framework for Mutual Enhancement of Objects and   Relationships in Open-vocabulary Video Visual Relationship Detection [25.542175004831844]
 Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories.<n>Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories.<n>We propose Mutual EnhancemenT of Objects and Relationships (METOR) to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios.
 arXiv  Detail & Related papers  (2025-05-10T14:45:43Z)
- Dynamic Relation Inference via Verb Embeddings [2.8436327410529483]
 We offer insights and practical methods to advance the field of relation inference from images.
We propose Dynamic Relation Inference via Verb Embeddings (DRIVE), which augments the COCO dataset, fine-tunes CLIP with hard negatives subject-relation-object triples and corresponding images, and introduces a novel loss function to improve relation detection.
 arXiv  Detail & Related papers  (2025-03-17T10:24:27Z)
- Object-centric Binding in Contrastive Language-Image Pretraining [9.376583779399834]
 We propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations.
Our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives.
Our resulting model paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
 arXiv  Detail & Related papers  (2025-02-19T21:30:51Z)
- Toward Modality Gap: Vision Prototype Learning for Weakly-supervised   Semantic Segmentation with CLIP [19.697857943845012]
 We propose a framework to learn class-specific vision prototypes in vision space with the help of text prototypes.
We also propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes.
Our proposed framework achieves state-of-the-art performance on two benchmark datasets.
 arXiv  Detail & Related papers  (2024-12-27T13:55:11Z)
- Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph   Generation [14.82606425343802]
 Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations.
Existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment.
We propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information.
 arXiv  Detail & Related papers  (2024-12-26T02:12:37Z)
- Relational Contrastive Learning and Masked Image Modeling for Scene Text   Recognition [36.59116507158687]
 We introduce a unified framework of Contrastive Learning and Masked Image Modeling for STR (RCMSTR)
The proposed RCMSTR demonstrates superior performance in various STR-related downstream tasks, outperforming the existing state-of-the-art self-supervised STR techniques.
 arXiv  Detail & Related papers  (2024-11-18T01:11:47Z)
- Exploring Interactive Semantic Alignment for Efficient HOI Detection   with Vision-language Model [3.3772986620114387]
 We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
 arXiv  Detail & Related papers  (2024-04-19T07:24:32Z)
- Concept-Guided Prompt Learning for Generalization in Vision-Language
  Models [33.361744437967126]
 We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
 arXiv  Detail & Related papers  (2024-01-15T04:04:47Z)
- Towards More Unified In-context Visual Understanding [74.55332581979292]
 We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
 arXiv  Detail & Related papers  (2023-12-05T06:02:21Z)
- CLIP-based Synergistic Knowledge Transfer for Text-based Person
  Retrieval [66.93563107820687]
 We introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for Person Retrieval (TPR)
To explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections.
CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model.
 arXiv  Detail & Related papers  (2023-09-18T05:38:49Z)
- CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained   Vision-Language Model [55.321010757641524]
 We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
 arXiv  Detail & Related papers  (2023-05-23T12:51:20Z)
- Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
  Structured Representations [70.41385310930846]
 We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
 arXiv  Detail & Related papers  (2023-05-06T03:57:05Z)
- Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
 We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
 arXiv  Detail & Related papers  (2022-10-17T17:57:46Z)
- Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
 Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
 Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
 arXiv  Detail & Related papers  (2020-10-19T08:25:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.