Related papers: Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

URL: http://arxiv.org/abs/2506.05651v1
Date: Fri, 06 Jun 2025 00:43:15 GMT
Title: Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection
Authors: Shanmukha Vellamcheti, Sanjoy Kundu, Sathyanarayanan N. Aakur,
Abstract summary: This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors.<n>Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM and training a visual model to align these hypotheses with perceptual evidence (maximization)<n>We introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50)
Score: 6.253919624802853
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding relationships between objects is central to visual intelligence, with applications in embodied AI, assistive systems, and scene understanding. Yet, most visual relationship detection (VRD) models rely on a fixed predicate set, limiting their generalization to novel interactions. A key challenge is the inability to visually ground semantically plausible, but unannotated, relationships hypothesized from external knowledge. This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors. Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM (expectation) and training a visual model to align these hypotheses with perceptual evidence (maximization). This process bootstraps relational understanding beyond annotated data and enables generalization to unseen predicates. Additionally, we introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50) of 15.9, 13.1, and 11.7 on predicate classification on these three sets. These results highlight the promise of grounded LLM priors for scalable open-world visual understanding.

Related papers

Open World Scene Graph Generation using Vision Language Models [7.024230124913843]
Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships.<n>We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of Vision Language Models (VLMs)<n>Our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets.
arXiv Detail & Related papers (2025-06-09T19:59:05Z)
Generalized Visual Relation Detection with Diffusion Models [94.62313788626128]
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image.<n>We propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner.<n>Our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets.
arXiv Detail & Related papers (2025-04-16T14:03:24Z)
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks [51.31903029903904]
In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them.<n> PRISM-0 is a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach.<n> PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval.
arXiv Detail & Related papers (2025-04-01T14:29:51Z)
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.<n>We have established a new REC dataset characterized by two key features.<n>It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z)
A Modern Take on Visual Relationship Reasoning for Grasp Planning [10.543168383800532]
We present a modern take on visual relational reasoning for grasp planning.<n>We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories.<n>We also propose D3G, a new end-to-end transformer-based dependency graph generation model.
arXiv Detail & Related papers (2024-09-03T16:30:48Z)
RelVAE: Generative Pretraining for few-shot Visual Relationship Detection [2.2230760534775915]
We present the first pretraining method for few-shot predicate classification that does not require any annotated relations. We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets.
arXiv Detail & Related papers (2023-11-27T19:08:08Z)
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs) Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for Scene Graph Generation [12.977857322594206]
One-stage scene graph generation approaches infer the effective relation between entity pairs using sparse proposal sets and a few queries. A Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model. Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced.
arXiv Detail & Related papers (2022-12-19T09:47:27Z)
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [139.0548263507796]
We use vision transformers (ViTs) as our base model for visual reasoning. We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
arXiv Detail & Related papers (2022-04-24T02:46:43Z)
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images. We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.