Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph
Generation via Visual-Concept Alignment and Retention
- URL: http://arxiv.org/abs/2311.10988v1
- Date: Sat, 18 Nov 2023 06:49:17 GMT
- Title: Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph
Generation via Visual-Concept Alignment and Retention
- Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang and Changwen Chen
- Abstract summary: Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications.
We propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view.
For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data.
- Score: 74.42036028592705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene Graph Generation (SGG) offers a structured representation critical in
many computer vision applications. Traditional SGG approaches, however, are
limited by a closed-set assumption, restricting their ability to recognize only
predefined object and relation categories. To overcome this, we categorize SGG
scenarios into four distinct settings based on the node and edge: Closed-set
SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary
Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relation-based
SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied
recently, the more challenging problem of relation-involved open-vocabulary SGG
remains relatively unexplored. To fill this gap, we propose a unified framework
named OvSGTR towards fully open vocabulary SGG from a holistic view. The
proposed framework is an end-toend transformer architecture, which learns a
visual-concept alignment for both nodes and edges, enabling the model to
recognize unseen categories. For the more challenging settings of
relation-involved open vocabulary SGG, the proposed approach integrates
relation-aware pre-training utilizing image-caption data and retains
visual-concept alignment through knowledge distillation. Comprehensive
experimental results on the Visual Genome benchmark demonstrate the
effectiveness and superiority of the proposed framework.
Related papers
- HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation [13.929906773382752]
A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG)
We propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset.
We show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks.
arXiv Detail & Related papers (2024-03-18T17:59:10Z) - Adaptive Self-training Framework for Fine-grained Scene Graph Generation [31.12694282340842]
Scene graph generation (SGG) models have suffered from inherent problems regarding the benchmark datasets.
We introduce a Self-Training framework for SGG (ST-SGG) that assigns pseudo-labels for unannotated triplets.
Our experiments verify the effectiveness of ST-SGG on various SGG models.
arXiv Detail & Related papers (2024-01-18T08:10:34Z) - Visually-Prompted Language Model for Fine-Grained Scene Graph Generation
in an Open World [67.03968403301143]
Scene Graph Generation (SGG) aims to extract subject, predicate, object> relationships in images for vision understanding.
Existing re-balancing strategies try to handle it via prior rules but are still confined to pre-defined conditions.
We propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates.
arXiv Detail & Related papers (2023-03-23T13:06:38Z) - Towards Open-vocabulary Scene Graph Generation with Prompt-based
Finetuning [84.39787427288525]
Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image.
We introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting in which a model is trained on a set of base object classes.
Our method can support inference over completely unseen object classes, which existing methods are incapable of handling.
arXiv Detail & Related papers (2022-08-17T09:05:38Z) - HL-Net: Heterophily Learning Network for Scene Graph Generation [90.2766568914452]
We propose a novel Heterophily Learning Network (HL-Net) to explore the homophily and heterophily between objects/relationships in scene graphs.
HL-Net comprises the following 1) an adaptive reweighting transformer module, which adaptively integrates the information from different layers to exploit both the heterophily and homophily in objects.
We conducted extensive experiments on two public datasets: Visual Genome (VG) and Open Images (OI)
arXiv Detail & Related papers (2022-05-03T06:00:29Z) - Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased
Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph.
We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction.
We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z) - Zero-Shot Scene Graph Relation Prediction through Commonsense Knowledge
Integration [9.203403318435486]
We propose CommOnsense-integrAted sCenegrapHrElation pRediction (COACHER), a framework to integrate commonsense knowledge for scene graph generation (SGG)
Specifically, we develop novel graph mining pipelines to model the neighborhoods and paths around entities in an external commonsense knowledge graph.
arXiv Detail & Related papers (2021-07-11T16:22:45Z) - Weakly Supervised Visual Semantic Parsing [49.69377653925448]
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images.
Existing SGG methods require millions of manually annotated bounding boxes for training.
We propose Visual Semantic Parsing, VSPNet, and graph-based weakly supervised learning framework.
arXiv Detail & Related papers (2020-01-08T03:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.