Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
- URL: http://arxiv.org/abs/2311.10988v2
- Date: Mon, 07 Oct 2024 07:15:19 GMT
- Title: Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
- Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang, Changwen Chen,
- Abstract summary: Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications.
We propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view.
For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pretraining.
- Score: 69.36723767339001
- License:
- Abstract: Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relationbased SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pretraining utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework. Our code is available at https://github.com/gpt4vision/OvSGTR/.
Related papers
- Scene Graph Generation with Role-Playing Large Language Models [50.252588437973245]
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP.
We propose SDSGG, a scene-specific description based OVSGG framework.
To capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter.
arXiv Detail & Related papers (2024-10-20T11:40:31Z) - HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation [13.929906773382752]
A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG)
We propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset.
We show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks.
arXiv Detail & Related papers (2024-03-18T17:59:10Z) - Adaptive Self-training Framework for Fine-grained Scene Graph Generation [29.37568710952893]
Scene graph generation (SGG) models have suffered from inherent problems regarding the benchmark datasets.
We introduce a Self-Training framework for SGG (ST-SGG) that assigns pseudo-labels for unannotated triplets.
Our experiments verify the effectiveness of ST-SGG on various SGG models.
arXiv Detail & Related papers (2024-01-18T08:10:34Z) - Adaptive Visual Scene Understanding: Incremental Scene Graph Generation [18.541428517746034]
Scene graph generation (SGG) analyzes images to extract meaningful information about objects and their relationships.
We present a benchmark comprising three learning regimes: relationship incremental, scene incremental, and relationship generalization.
We also introduce a Replays via Analysis by Synthesis" method named RAS.
arXiv Detail & Related papers (2023-10-02T21:02:23Z) - Visually-Prompted Language Model for Fine-Grained Scene Graph Generation
in an Open World [67.03968403301143]
Scene Graph Generation (SGG) aims to extract subject, predicate, object> relationships in images for vision understanding.
Existing re-balancing strategies try to handle it via prior rules but are still confined to pre-defined conditions.
We propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates.
arXiv Detail & Related papers (2023-03-23T13:06:38Z) - Towards Open-vocabulary Scene Graph Generation with Prompt-based
Finetuning [84.39787427288525]
Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image.
We introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting in which a model is trained on a set of base object classes.
Our method can support inference over completely unseen object classes, which existing methods are incapable of handling.
arXiv Detail & Related papers (2022-08-17T09:05:38Z) - Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased
Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph.
We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction.
We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z) - Weakly Supervised Visual Semantic Parsing [49.69377653925448]
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images.
Existing SGG methods require millions of manually annotated bounding boxes for training.
We propose Visual Semantic Parsing, VSPNet, and graph-based weakly supervised learning framework.
arXiv Detail & Related papers (2020-01-08T03:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.