Related papers: From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

URL: http://arxiv.org/abs/2404.00906v3
Date: Wed, 24 Apr 2024 07:15:16 GMT
Title: From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
Authors: Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, Xuming He,
Abstract summary: Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Existing methods struggle to generate scene graphs with novel visual relation concepts. We introduce a new open-vocabulary SGG framework based on sequence generation.
Score: 81.92098140232638
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.

Related papers

Open World Scene Graph Generation using Vision Language Models [7.024230124913843]
Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships.<n>We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of Vision Language Models (VLMs)<n>Our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets.
arXiv Detail & Related papers (2025-06-09T19:59:05Z)
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks [51.31903029903904]
In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them. PRISM-0 is a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach. PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval.
arXiv Detail & Related papers (2025-04-01T14:29:51Z)
Fine-Grained Video Captioning through Scene Graph Consolidation [44.30028794237688]
We propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for caption generation. Our method first generates frame-level captions using an image VLM, converts them into scene graphs, and consolidates these graphs to produce comprehensive video-level descriptions.
arXiv Detail & Related papers (2025-02-23T03:59:05Z)
Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation. We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner. Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z)
Joint Generative Modeling of Scene Graphs and Images via Diffusion Models [37.788957749123725]
We present a novel generative task: joint scene graph - image generation. We introduce a novel diffusion model, DiffuseSG, that jointly models the adjacency matrix along with heterogeneous node and edge attributes. With a graph transformer being the denoiser, DiffuseSG successively denoises the scene graph representation in a continuous space and discretizes the final representation to generate the clean scene graph.
arXiv Detail & Related papers (2024-01-02T10:10:29Z)
SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering [0.0]
Scene graphs have emerged as a useful tool for multimodal image analysis. Current methods that utilize idealized annotated scene graphs struggle to generalize when using predicted scene graphs extracted from images. Our approach extracts a scene graph from an input image using a pre-trained scene graph generator.
arXiv Detail & Related papers (2023-10-03T07:14:53Z)
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z)
Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query. We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE) We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z)
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning. We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z)
Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph. We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z)
Unconditional Scene Graph Generation [72.53624470737712]
We develop a deep auto-regressive model called SceneGraphGen which can learn the probability distribution over labelled and directed graphs. We show that the scene graphs generated by SceneGraphGen are diverse and follow the semantic patterns of real-world scenes.
arXiv Detail & Related papers (2021-08-12T17:57:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.