Improving Scene Graph Generation with Relation Words' Debiasing in Vision-Language Models
- URL: http://arxiv.org/abs/2403.16184v1
- Date: Sun, 24 Mar 2024 15:02:24 GMT
- Title: Improving Scene Graph Generation with Relation Words' Debiasing in Vision-Language Models
- Authors: Yuxuan Wang, Xiaoyuan Liu,
- Abstract summary: Scene Graph Generation (SGG) provides basic language representation of visual scenes.
Part of test triplets are rare or even unseen during training, resulting in predictions.
We propose using the SGG models with pretrained vision-language models (VLMs) to enhance representation.
- Score: 6.8754535229258975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene Graph Generation (SGG) provides basic language representation of visual scenes, requiring models to grasp complex and diverse semantics between various objects. However, this complexity and diversity in SGG also leads to underrepresentation, where part of test triplets are rare or even unseen during training, resulting in imprecise predictions. To tackle this, we propose using the SGG models with pretrained vision-language models (VLMs) to enhance representation. However, due to the gap between the pretraining and SGG, directly ensembling the pretrained VLMs leads to severe biases across relation words. Thus, we introduce LM Estimation to approximate the words' distribution underlies in the pretraining language sets, and then use the distribution for debiasing. After that, we ensemble VLMs with SGG models to enhance representation. Considering that each model may represent better at different samples, we use a certainty-aware indicator to score each sample and dynamically adjust the ensemble weights. Our method effectively addresses the words biases, enhances SGG's representation, and achieve markable performance enhancements. It is training-free and integrates well with existing SGG models.
Related papers
- Ensemble Predicate Decoding for Unbiased Scene Graph Generation [40.01591739856469]
Scene Graph Generation (SGG) aims to generate a comprehensive graphical representation that captures semantic information of a given scenario.
The model's performance in predicting more fine-grained predicates is hindered by a significant predicate bias.
This paper proposes Ensemble Predicate Decoding (EPD), which employs multiple decoders to attain unbiased scene graph generation.
arXiv Detail & Related papers (2024-08-26T11:24:13Z) - Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation [21.772806350802203]
In scene graph generation (SGG) datasets, each subject-object pair is annotated with a single predicate.
Existing SGG models are trained to predict the one and only predicate for each pair.
This in turn results in the SGG models to overlook the semantic diversity that may exist in a predicate.
arXiv Detail & Related papers (2024-07-22T05:53:46Z) - Informative Scene Graph Generation via Debiasing [124.71164256146342]
Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object)
Due to biases in data, current models tend to predict common predicates.
We propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting.
arXiv Detail & Related papers (2023-08-10T02:04:01Z) - Panoptic Scene Graph Generation with Semantics-Prototype Learning [23.759498629378772]
Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes.
Different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations.
We propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones.
arXiv Detail & Related papers (2023-07-28T14:04:06Z) - LANDMARK: Language-guided Representation Enhancement Framework for Scene
Graph Generation [34.40862385518366]
Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and dataset longtail problem.
We propose LANDMARK (LANguage-guiDed representationenhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns.
This framework is model-agnostic and consistently improves performance on existing SGG models.
arXiv Detail & Related papers (2023-03-02T09:03:11Z) - CAME: Context-aware Mixture-of-Experts for Unbiased Scene Graph
Generation [10.724516317292926]
We present a simple yet effective method called Context-Aware Mixture-of-Experts (CAME) to improve the model diversity and alleviate the biased scene graph generator.
We have conducted extensive experiments on three tasks on the Visual Genome dataset to show that came achieved superior performance over previous methods.
arXiv Detail & Related papers (2022-08-15T10:39:55Z) - NICEST: Noisy Label Correction and Training for Robust Scene Graph Generation [65.78472854070316]
We propose a novel NoIsy label CorrEction and Sample Training strategy for SGG: NICEST.
NICE first detects noisy samples and then reassigns them more high-quality soft predicate labels.
NICEST can be seamlessly incorporated into any SGG architecture to boost its performance on different predicate categories.
arXiv Detail & Related papers (2022-07-27T06:25:47Z) - Adaptive Fine-Grained Predicates Learning for Scene Graph Generation [122.4588401267544]
General Scene Graph Generation (SGG) models tend to predict head predicates and re-balancing strategies prefer tail categories.
We propose an Adaptive Fine-Grained Predicates Learning (FGPL-A) which aims at differentiating hard-to-distinguish predicates for SGG.
Our proposed model-agnostic strategy significantly boosts performance of benchmark models on VG-SGG and GQA-SGG datasets by up to 175% and 76% on Mean Recall@100, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2022-07-11T03:37:57Z) - Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased
Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph.
We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction.
We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - From General to Specific: Informative Scene Graph Generation via Balance
Adjustment [113.04103371481067]
Current models are stuck in common predicates, e.g., "on" and "at", rather than informative ones.
We propose BA-SGG, a framework based on balance adjustment but not the conventional distribution fitting.
Our method achieves 14.3%, 8.0%, and 6.1% higher Mean Recall (mR) than that of the Transformer model at three scene graph generation sub-tasks on Visual Genome.
arXiv Detail & Related papers (2021-08-30T11:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.