PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
- URL: http://arxiv.org/abs/2504.00844v1
- Date: Tue, 01 Apr 2025 14:29:51 GMT
- Title: PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
- Authors: Abdelrahman Elskhawy, Mengze Li, Nassir Navab, Benjamin Busam,
- Abstract summary: In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them.<n> PRISM-0 is a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach.<n> PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval.
- Score: 51.31903029903904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them. This facilitates image-based understanding and reasoning for various downstream tasks. Although fully supervised SGG approaches showed steady performance improvements, they suffer from a severe training bias. This is caused by the availability of only small subsets of curated data and exhibits long-tail predicate distribution issues with a lack of predicate diversity adversely affecting downstream tasks. To overcome this, we introduce PRISM-0, a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach to capture the whole spectrum of diverse, open-vocabulary predicate prediction. Detected object pairs are filtered and passed to a Vision Language Model (VLM) that generates descriptive captions. These are used to prompt an LLM to generate fine-andcoarse-grained predicates for the pair. The predicates are then validated using a VQA model to provide a final SGG. With the modular and dataset-independent PRISM-0, we can enrich existing SG datasets such as Visual Genome (VG). Experiments illustrate that PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval with a performance on par to the best fully supervised methods.
Related papers
- Ensemble Predicate Decoding for Unbiased Scene Graph Generation [40.01591739856469]
Scene Graph Generation (SGG) aims to generate a comprehensive graphical representation that captures semantic information of a given scenario.
The model's performance in predicting more fine-grained predicates is hindered by a significant predicate bias.
This paper proposes Ensemble Predicate Decoding (EPD), which employs multiple decoders to attain unbiased scene graph generation.
arXiv Detail & Related papers (2024-08-26T11:24:13Z) - From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.
Existing methods struggle to generate scene graphs with novel visual relation concepts.
We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z) - Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement [6.8754535229258975]
Scene Graph Generation (SGG) provides basic language representation of visual scenes.<n>Part of triplet labels are rare or even unseen during training, resulting in imprecise predictions.<n>We propose integrating pretrained Vision-language Models to enhance representation.
arXiv Detail & Related papers (2024-03-24T15:02:24Z) - Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention [69.36723767339001]
Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications.
We propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view.
For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pretraining.
arXiv Detail & Related papers (2023-11-18T06:49:17Z) - Visually-Prompted Language Model for Fine-Grained Scene Graph Generation
in an Open World [67.03968403301143]
Scene Graph Generation (SGG) aims to extract subject, predicate, object> relationships in images for vision understanding.
Existing re-balancing strategies try to handle it via prior rules but are still confined to pre-defined conditions.
We propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates.
arXiv Detail & Related papers (2023-03-23T13:06:38Z) - Fine-Grained Scene Graph Generation with Data Transfer [127.17675443137064]
Scene graph generation (SGG) aims to extract (subject, predicate, object) triplets in images.
Recent works have made a steady progress on SGG, and provide useful tools for high-level vision and language understanding.
We propose a novel Internal and External Data Transfer (IETrans) method, which can be applied in a play-and-plug fashion and expanded to large SGG with 1,807 predicate classes.
arXiv Detail & Related papers (2022-03-22T12:26:56Z) - Weakly Supervised Visual Semantic Parsing [49.69377653925448]
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images.
Existing SGG methods require millions of manually annotated bounding boxes for training.
We propose Visual Semantic Parsing, VSPNet, and graph-based weakly supervised learning framework.
arXiv Detail & Related papers (2020-01-08T03:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.