Visually-Prompted Language Model for Fine-Grained Scene Graph Generation
in an Open World
- URL: http://arxiv.org/abs/2303.13233v2
- Date: Sat, 19 Aug 2023 14:41:36 GMT
- Title: Visually-Prompted Language Model for Fine-Grained Scene Graph Generation
in an Open World
- Authors: Qifan Yu, Juncheng Li, Yu Wu, Siliang Tang, Wei Ji, Yueting Zhuang
- Abstract summary: Scene Graph Generation (SGG) aims to extract subject, predicate, object> relationships in images for vision understanding.
Existing re-balancing strategies try to handle it via prior rules but are still confined to pre-defined conditions.
We propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates.
- Score: 67.03968403301143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene Graph Generation (SGG) aims to extract <subject, predicate, object>
relationships in images for vision understanding. Although recent works have
made steady progress on SGG, they still suffer long-tail distribution issues
that tail-predicates are more costly to train and hard to distinguish due to a
small amount of annotated data compared to frequent predicates. Existing
re-balancing strategies try to handle it via prior rules but are still confined
to pre-defined conditions, which are not scalable for various models and
datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao)
framework, where a visually-prompted language model is learned to generate
diverse fine-grained predicates in a low-resource way. The proposed CaCao can
be applied in a plug-and-play fashion and automatically strengthen existing SGG
to tackle the long-tailed problem. Based on that, we further introduce a novel
Entangled cross-modal prompt approach for open-world predicate scene graph
generation (Epic), where models can generalize to unseen predicates in a
zero-shot manner. Comprehensive experiments on three benchmark datasets show
that CaCao consistently boosts the performance of multiple scene graph
generation models in a model-agnostic way. Moreover, our Epic achieves
competitive performance on open-world predicate prediction. The data and code
for this paper are publicly available.
Related papers
- Towards Lifelong Scene Graph Generation with Knowledge-ware In-context
Prompt Learning [24.98058940030532]
Scene graph generation (SGG) endeavors to predict visual relationships between pairs of objects within an image.
This work seeks to address the pitfall inherent in a suite of prior relationship predictions.
Motivated by the achievements of in-context learning in pretrained language models, our approach imbues the model with the capability to predict relationships.
arXiv Detail & Related papers (2024-01-26T03:43:22Z) - Local-Global Information Interaction Debiasing for Dynamic Scene Graph
Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information.
Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z) - Decomposed Prototype Learning for Few-Shot Scene Graph Generation [28.796734816086065]
We focus on a new promising task of scene graph generation (SGG): few-shot SGG (FSSGG)
FSSGG encourages models to be able to quickly transfer previous knowledge and recognize novel predicates with only a few examples.
We propose a novel Decomposed Prototype Learning (DPL)
arXiv Detail & Related papers (2023-03-20T04:54:26Z) - LANDMARK: Language-guided Representation Enhancement Framework for Scene
Graph Generation [34.40862385518366]
Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and dataset longtail problem.
We propose LANDMARK (LANguage-guiDed representationenhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns.
This framework is model-agnostic and consistently improves performance on existing SGG models.
arXiv Detail & Related papers (2023-03-02T09:03:11Z) - Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query.
We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE)
We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z) - Towards Open-vocabulary Scene Graph Generation with Prompt-based
Finetuning [84.39787427288525]
Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image.
We introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting in which a model is trained on a set of base object classes.
Our method can support inference over completely unseen object classes, which existing methods are incapable of handling.
arXiv Detail & Related papers (2022-08-17T09:05:38Z) - Fine-Grained Scene Graph Generation with Data Transfer [127.17675443137064]
Scene graph generation (SGG) aims to extract (subject, predicate, object) triplets in images.
Recent works have made a steady progress on SGG, and provide useful tools for high-level vision and language understanding.
We propose a novel Internal and External Data Transfer (IETrans) method, which can be applied in a play-and-plug fashion and expanded to large SGG with 1,807 predicate classes.
arXiv Detail & Related papers (2022-03-22T12:26:56Z) - Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling [81.33107307509718]
We propose a topic adaptive storyteller to model the ability of inter-topic generalization.
We also propose a prototype encoding structure to model the ability of intra-topic derivation.
Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model.
arXiv Detail & Related papers (2020-08-11T03:55:11Z) - Generative Compositional Augmentations for Scene Graph Prediction [27.535630110794855]
Inferring objects and their relationships from an image in the form of a scene graph is useful in many applications at the intersection of vision and language.
We consider a challenging problem of compositional generalization that emerges in this task due to a long tail data distribution.
We propose and empirically study a model based on conditional generative adversarial networks (GANs) that allows us to generate visual features of perturbed scene graphs.
arXiv Detail & Related papers (2020-07-11T12:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.