Towards Unseen Triples: Effective Text-Image-joint Learning for Scene
Graph Generation
- URL: http://arxiv.org/abs/2306.13420v1
- Date: Fri, 23 Jun 2023 10:17:56 GMT
- Title: Towards Unseen Triples: Effective Text-Image-joint Learning for Scene
Graph Generation
- Authors: Qianji Di, Wenxi Ma, Zhongang Qi, Tianxiang Hou, Ying Shan, Hanzi Wang
- Abstract summary: Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images.
Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets.
We propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models.
- Score: 30.79358827005448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene Graph Generation (SGG) aims to structurally and comprehensively
represent objects and their connections in images, it can significantly benefit
scene understanding and other related downstream tasks. Existing SGG models
often struggle to solve the long-tailed problem caused by biased datasets.
However, even if these models can fit specific datasets better, it may be hard
for them to resolve the unseen triples which are not included in the training
set. Most methods tend to feed a whole triple and learn the overall features
based on statistical machine learning. Such models have difficulty predicting
unseen triples because the objects and predicates in the training set are
combined differently as novel triples in the test set. In this work, we propose
a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen
triples and improve the generalisation capability of the SGG models. We propose
a Joint Fearture Learning (JFL) module and a Factual Knowledge based Refinement
(FKR) module to learn object and predicate categories separately at the feature
level and align them with corresponding visual features so that the model is no
longer limited to triples matching. Besides, since we observe the long-tailed
problem also affects the generalization ability, we design a novel balanced
learning strategy, including a Charater Guided Sampling (CGS) and an
Informative Re-weighting (IR) module, to provide tailor-made learning methods
for each predicate according to their characters. Extensive experiments show
that our model achieves state-of-the-art performance. In more detail, TISGG
boosts the performances by 11.7% of zR@20(zero-shot recall) on the PredCls
sub-task on the Visual Genome dataset.
Related papers
- Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Leveraging Predicate and Triplet Learning for Scene Graph Generation [31.09787444957997]
Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets.
We propose a Dual-granularity Relation Modeling (DRM) network to leverage fine-grained triplet cues besides the coarse-grained predicate ones.
Our method establishes new state-of-the-art performance on Visual Genome, Open Image, and GQA datasets.
arXiv Detail & Related papers (2024-06-04T07:23:41Z) - Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency [3.351553095054309]
Scene graph generation (SGG) represents the relationships between objects in an image as a graph structure.
Previous studies have failed to reflect the co-occurrence of objects during SGG generation.
We propose CooK, which reflects the Co-occurrence Knowledge between objects, and the learnable term frequency-inverse document frequency.
arXiv Detail & Related papers (2024-05-21T09:56:48Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Local-Global Information Interaction Debiasing for Dynamic Scene Graph
Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information.
Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z) - Not All Relations are Equal: Mining Informative Labels for Scene Graph
Generation [48.21846438269506]
Scene graph generation (SGG) aims to capture a wide variety of interactions between pairs of objects.
Existing SGG methods fail to acquire complex reasoning about visual and textual correlations due to various biases in training data.
We propose a novel framework for SGG training that exploits relation labels based on their informativeness.
arXiv Detail & Related papers (2021-11-26T14:34:12Z) - From General to Specific: Informative Scene Graph Generation via Balance
Adjustment [113.04103371481067]
Current models are stuck in common predicates, e.g., "on" and "at", rather than informative ones.
We propose BA-SGG, a framework based on balance adjustment but not the conventional distribution fitting.
Our method achieves 14.3%, 8.0%, and 6.1% higher Mean Recall (mR) than that of the Transformer model at three scene graph generation sub-tasks on Visual Genome.
arXiv Detail & Related papers (2021-08-30T11:39:43Z) - Semantic Compositional Learning for Low-shot Scene Graph Generation [122.51930904132685]
Many scene graph generation (SGG) models solely use the limited annotated relation triples for training.
We propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples.
For three recent SGG models, adding our strategy improves their performance by close to 50%, and all of them substantially exceed the current state-of-the-art.
arXiv Detail & Related papers (2021-08-19T10:13:55Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.