Related papers: Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

URL: http://arxiv.org/abs/2311.17048v3
Date: Tue, 9 Apr 2024 17:54:12 GMT
Title: Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Authors: Zeyu Han, Fangrui Zhu, Qianru Lao, Huaizu Jiang,
Abstract summary: Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts. Existing vision-language alignment models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. We leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object)
Score: 6.231370972617915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC.

Related papers

Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection [6.253919624802853]
This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors.<n>Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM and training a visual model to align these hypotheses with perceptual evidence (maximization)<n>We introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50)
arXiv Detail & Related papers (2025-06-06T00:43:15Z)
DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data [67.99373622902827]
DIPO is a framework for controllable generation of articulated 3D objects from a pair of images.<n>We propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters.<n>We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions.
arXiv Detail & Related papers (2025-05-26T18:55:14Z)
Compositional Image-Text Matching and Retrieval by Grounding Entities [1.962396488631213]
We propose a novel learning-free zero-shot augmentation of CLIP embeddings that has favorable compositional properties.<n>We compute separate embeddings of sub-images of object entities and relations that are localized by the state of the art open vocabulary detectors.<n>The resulting embedding is then utilized for similarity computation with text embedding, resulting in a average 1.5% improvement in image-text matching accuracy.
arXiv Detail & Related papers (2025-05-04T22:18:14Z)
Generalized Visual Relation Detection with Diffusion Models [94.62313788626128]
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. We propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner. Our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets.
arXiv Detail & Related papers (2025-04-16T14:03:24Z)
Efficient Relational Context Perception for Knowledge Graph Completion [25.903926643251076]
Knowledge Graphs (KGs) provide a structured representation of knowledge but often suffer from challenges of incompleteness. Previous knowledge graph embedding models are limited in their ability to capture expressive features. We propose Triple Receptance Perception architecture to model sequential information, enabling the learning of dynamic context.
arXiv Detail & Related papers (2024-12-31T11:25:58Z)
Towards Flexible Visual Relationship Segmentation [25.890273232954055]
Visual relationship understanding has been studied separately in human-object interaction detection, scene graph generation, and relationships referring tasks. We propose FleVRS, a single model that seamlessly integrates the above three aspects in standard and promptable visual relationship segmentation. Our framework outperforms existing models in standard, promptable, and open-vocabulary tasks.
arXiv Detail & Related papers (2024-08-15T17:57:38Z)
VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed textbfVisual textbfMinimal-Change Understanding (VisMin) VisMin requires models to predict the correct image-caption match given two images and two captions. We generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks.
arXiv Detail & Related papers (2024-07-23T18:10:43Z)
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z)
Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN) The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z)
Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments. We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z)
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
Semantic Compositional Learning for Low-shot Scene Graph Generation [122.51930904132685]
Many scene graph generation (SGG) models solely use the limited annotated relation triples for training. We propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples. For three recent SGG models, adding our strategy improves their performance by close to 50%, and all of them substantially exceed the current state-of-the-art.
arXiv Detail & Related papers (2021-08-19T10:13:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.