Related papers: Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

URL: http://arxiv.org/abs/2311.10988v1
Date: Sat, 18 Nov 2023 06:49:17 GMT
Title: Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang and Changwen Chen
Abstract summary: Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. We propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data.
Score: 74.42036028592705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relation-based SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-toend transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework.

Related papers

Open World Scene Graph Generation using Vision Language Models [7.024230124913843]
Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships.<n>We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of Vision Language Models (VLMs)<n>Our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets.
arXiv Detail & Related papers (2025-06-09T19:59:05Z)
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks [51.31903029903904]
In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them. PRISM-0 is a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach. PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval.
arXiv Detail & Related papers (2025-04-01T14:29:51Z)
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations [13.055077747280917]
Scene Graph Generation (SGG) converts visual scenes into structured graph representations. Existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. We propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling.
arXiv Detail & Related papers (2024-12-09T09:18:32Z)
Scene Graph Generation with Role-Playing Large Language Models [50.252588437973245]
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP. We propose SDSGG, a scene-specific description based OVSGG framework. To capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter.
arXiv Detail & Related papers (2024-10-20T11:40:31Z)
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation [13.929906773382752]
A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG) We propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. We show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks.
arXiv Detail & Related papers (2024-03-18T17:59:10Z)
Adaptive Self-training Framework for Fine-grained Scene Graph Generation [29.37568710952893]
Scene graph generation (SGG) models have suffered from inherent problems regarding the benchmark datasets. We introduce a Self-Training framework for SGG (ST-SGG) that assigns pseudo-labels for unannotated triplets. Our experiments verify the effectiveness of ST-SGG on various SGG models.
arXiv Detail & Related papers (2024-01-18T08:10:34Z)
Adaptive Visual Scene Understanding: Incremental Scene Graph Generation [18.541428517746034]
Scene graph generation (SGG) analyzes images to extract meaningful information about objects and their relationships. We present a benchmark comprising three learning regimes: relationship incremental, scene incremental, and relationship generalization. We also introduce a Replays via Analysis by Synthesis" method named RAS.
arXiv Detail & Related papers (2023-10-02T21:02:23Z)
Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World [67.03968403301143]
Scene Graph Generation (SGG) aims to extract subject, predicate, object> relationships in images for vision understanding. Existing re-balancing strategies try to handle it via prior rules but are still confined to pre-defined conditions. We propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates.
arXiv Detail & Related papers (2023-03-23T13:06:38Z)
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning [84.39787427288525]
Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image. We introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting in which a model is trained on a set of base object classes. Our method can support inference over completely unseen object classes, which existing methods are incapable of handling.
arXiv Detail & Related papers (2022-08-17T09:05:38Z)
Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph. We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction. We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z)
Weakly Supervised Visual Semantic Parsing [49.69377653925448]
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images. Existing SGG methods require millions of manually annotated bounding boxes for training. We propose Visual Semantic Parsing, VSPNet, and graph-based weakly supervised learning framework.
arXiv Detail & Related papers (2020-01-08T03:46:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.