Related papers: Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation

Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation

URL: http://arxiv.org/abs/2509.01209v1
Date: Mon, 01 Sep 2025 07:46:58 GMT
Title: Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation
Authors: Maëlic Neau, Zoe Falomir, Cédric Buche, Akihiro Sugimoto,
Abstract summary: Scene Graph Generation (SGG) encodes visual relationships between objects in images as graph structures.<n>Current benchmarks in SGG possess a very limited vocabulary.<n>We propose a new reference-free metric to fairly evaluate the open-vocabulary capabilities of VLMs for relation prediction.
Score: 4.633828400918887
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Scene Graph Generation (SGG) encodes visual relationships between objects in images as graph structures. Thanks to the advances of Vision-Language Models (VLMs), the task of Open-Vocabulary SGG has been recently proposed where models are evaluated on their functionality to learn a wide and diverse range of relations. Current benchmarks in SGG, however, possess a very limited vocabulary, making the evaluation of open-source models inefficient. In this paper, we propose a new reference-free metric to fairly evaluate the open-vocabulary capabilities of VLMs for relation prediction. Another limitation of Open-Vocabulary SGG is the reliance on weakly supervised data of poor quality for pre-training. We also propose a new solution for quickly generating high-quality synthetic data through region-specific prompt tuning of VLMs. Experimental results show that pre-training with this new data split can benefit the generalization capabilities of Open-Voc SGG models.

Related papers

VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation [18.15310805625469]
VOST-SGG is a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models.<n>We propose a multi-modal feature bank that fuses visual, textual, and spatial cues for improved predicate classification.<n>Our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG.
arXiv Detail & Related papers (2025-12-05T08:34:06Z)
Open World Scene Graph Generation using Vision Language Models [7.024230124913843]
Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships.<n>We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of Vision Language Models (VLMs)<n>Our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets.
arXiv Detail & Related papers (2025-06-09T19:59:05Z)
Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z)
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks [51.31903029903904]
In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them.<n> PRISM-0 is a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach.<n> PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval.
arXiv Detail & Related papers (2025-04-01T14:29:51Z)
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Existing methods struggle to generate scene graphs with novel visual relation concepts. We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z)
Leveraging Large Language Models for Node Generation in Few-Shot Learning on Text-Attributed Graphs [5.587264586806575]
We propose a plug-and-play approach to empower text-attributed graphs through node generation using Large Language Models (LLMs)<n>LLMs extract semantic information from labels and generate samples that belong to categories as exemplars.<n>We employ an edge predictor to capture structural information inherent in the raw dataset and integrate the newly generated samples into the original graph.
arXiv Detail & Related papers (2023-10-15T16:04:28Z)
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment. Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z)
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs [79.64891686479213]
We show that it is possible to improve vision and language models (VLMs) when learning from scene graphs (SGs) For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions. Our method improves the performance of several popular VLMs on multiple datasets with only a mild degradation in ZS capabilities.
arXiv Detail & Related papers (2023-05-10T17:52:26Z)
Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World [67.03968403301143]
Scene Graph Generation (SGG) aims to extract subject, predicate, object> relationships in images for vision understanding. Existing re-balancing strategies try to handle it via prior rules but are still confined to pre-defined conditions. We propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates.
arXiv Detail & Related papers (2023-03-23T13:06:38Z)
LUKE-Graph: A Transformer-based Approach with Gated Relational Graph Attention for Cloze-style Reading Comprehension [13.173307471333619]
We propose the LUKE-Graph, a model that builds a heterogeneous graph based on the intuitive relationships between entities in a document. We then use the Attention reading (RGAT) to fuse the graph's reasoning information and the contextual representation encoded by the pre-trained LUKE model. Experimental results demonstrate that the LUKE-Graph achieves state-of-the-art performance with commonsense reasoning.
arXiv Detail & Related papers (2023-03-12T14:31:44Z)
Fine-Grained Scene Graph Generation with Data Transfer [127.17675443137064]
Scene graph generation (SGG) aims to extract (subject, predicate, object) triplets in images. Recent works have made a steady progress on SGG, and provide useful tools for high-level vision and language understanding. We propose a novel Internal and External Data Transfer (IETrans) method, which can be applied in a play-and-plug fashion and expanded to large SGG with 1,807 predicate classes.
arXiv Detail & Related papers (2022-03-22T12:26:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.