Related papers: Open World Scene Graph Generation using Vision Language Models

Open World Scene Graph Generation using Vision Language Models

URL: http://arxiv.org/abs/2506.08189v1
Date: Mon, 09 Jun 2025 19:59:05 GMT
Title: Open World Scene Graph Generation using Vision Language Models
Authors: Amartya Dutta, Kazi Sajeed Mehrab, Medha Sawhney, Abhilash Neog, Mridul Khurana, Sepideh Fatemi, Aanish Pradhan, M. Maruf, Ismini Lourentzou, Arka Daw, Anuj Karpatne,
Abstract summary: Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships.<n>We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of Vision Language Models (VLMs)<n>Our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets.
Score: 7.024230124913843
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships. Most methods depend on dataset-specific supervision to learn the variety of interactions, restricting their usefulness in open-world settings, involving novel objects and/or relations. Even methods that leverage large Vision Language Models (VLMs) typically require benchmark-specific fine-tuning. We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of VLMs to produce scene graphs with zero additional learning. Casting SGG as a zero-shot structured-reasoning problem, our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets. To assess this setting, we formalize an Open-World evaluation protocol that measures performance when no SGG-specific data have been observed either in terms of objects and relations. Experiments on Visual Genome, Open Images V6, and the Panoptic Scene Graph (PSG) dataset demonstrate the capacity of pretrained VLMs to perform relational understanding without task-level training.

Related papers

PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks [51.31903029903904]
In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them.<n> PRISM-0 is a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach.<n> PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval.
arXiv Detail & Related papers (2025-04-01T14:29:51Z)
Scene Graph Generation with Role-Playing Large Language Models [50.252588437973245]
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP. We propose SDSGG, a scene-specific description based OVSGG framework. To capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter.
arXiv Detail & Related papers (2024-10-20T11:40:31Z)
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Existing methods struggle to generate scene graphs with novel visual relation concepts. We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z)
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z)
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention [69.36723767339001]
Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. We propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pretraining.
arXiv Detail & Related papers (2023-11-18T06:49:17Z)
Adaptive Visual Scene Understanding: Incremental Scene Graph Generation [18.541428517746034]
Scene graph generation (SGG) analyzes images to extract meaningful information about objects and their relationships. We present a benchmark comprising three learning regimes: relationship incremental, scene incremental, and relationship generalization. We also introduce a Replays via Analysis by Synthesis" method named RAS.
arXiv Detail & Related papers (2023-10-02T21:02:23Z)
Local-Global Information Interaction Debiasing for Dynamic Scene Graph Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information. Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z)
LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation [34.40862385518366]
Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and dataset longtail problem. We propose LANDMARK (LANguage-guiDed representationenhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns. This framework is model-agnostic and consistently improves performance on existing SGG models.
arXiv Detail & Related papers (2023-03-02T09:03:11Z)
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning [84.39787427288525]
Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image. We introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting in which a model is trained on a set of base object classes. Our method can support inference over completely unseen object classes, which existing methods are incapable of handling.
arXiv Detail & Related papers (2022-08-17T09:05:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.