LANDMARK: Language-guided Representation Enhancement Framework for Scene
Graph Generation
- URL: http://arxiv.org/abs/2303.01080v1
- Date: Thu, 2 Mar 2023 09:03:11 GMT
- Title: LANDMARK: Language-guided Representation Enhancement Framework for Scene
Graph Generation
- Authors: Xiaoguang Chang, Teng Wang, Shaowei Cai and Changyin Sun
- Abstract summary: Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and dataset longtail problem.
We propose LANDMARK (LANguage-guiDed representationenhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns.
This framework is model-agnostic and consistently improves performance on existing SGG models.
- Score: 34.40862385518366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene graph generation (SGG) is a sophisticated task that suffers from both
complex visual features and dataset long-tail problem. Recently, various
unbiased strategies have been proposed by designing novel loss functions and
data balancing strategies. Unfortunately, these unbiased methods fail to
emphasize language priors in feature refinement perspective. Inspired by the
fact that predicates are highly correlated with semantics hidden in
subject-object pair and global context, we propose LANDMARK (LANguage-guiDed
representationenhanceMent frAmewoRK) that learns predicate-relevant
representations from language-vision interactive patterns, global language
context and pair-predicate correlation. Specifically, we first project object
labels to three distinctive semantic embeddings for different representation
learning. Then, Language Attention Module (LAM) and Experience Estimation
Module (EEM) process subject-object word embeddings to attention vector and
predicate distribution, respectively. Language Context Module (LCM) encodes
global context from each word embed-ding, which avoids isolated learning from
local information. Finally, modules outputs are used to update visual
representations and SGG model's prediction. All language representations are
purely generated from object categories so that no extra knowledge is needed.
This framework is model-agnostic and consistently improves performance on
existing SGG models. Besides, representation-level unbiased strategies endow
LANDMARK the advantage of compatibility with other methods. Code is available
at https://github.com/rafa-cxg/PySGG-cxg.
Related papers
- Scene Graph Generation with Role-Playing Large Language Models [50.252588437973245]
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP.
We propose SDSGG, a scene-specific description based OVSGG framework.
To capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter.
arXiv Detail & Related papers (2024-10-20T11:40:31Z) - Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition [57.97930719585095]
We introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales.
Our approach is evaluated on various skeleton/language backbones and three large-scale datasets.
The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains.
arXiv Detail & Related papers (2024-06-19T08:22:32Z) - UniGLM: Training One Unified Language Model for Text-Attributed Graphs [31.464021556351685]
Unified Graph Language Model (UniGLM) is a graph embedding model that generalizes well to both in-domain and cross-domain TAGs.
UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training.
arXiv Detail & Related papers (2024-06-17T19:45:21Z) - Improving Scene Graph Generation with Relation Words' Debiasing in Vision-Language Models [6.8754535229258975]
Scene Graph Generation (SGG) provides basic language representation of visual scenes.
Part of test triplets are rare or even unseen during training, resulting in predictions.
We propose using the SGG models with pretrained vision-language models (VLMs) to enhance representation.
arXiv Detail & Related papers (2024-03-24T15:02:24Z) - Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Visually-Prompted Language Model for Fine-Grained Scene Graph Generation
in an Open World [67.03968403301143]
Scene Graph Generation (SGG) aims to extract subject, predicate, object> relationships in images for vision understanding.
Existing re-balancing strategies try to handle it via prior rules but are still confined to pre-defined conditions.
We propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates.
arXiv Detail & Related papers (2023-03-23T13:06:38Z) - Decomposed Prototype Learning for Few-Shot Scene Graph Generation [28.796734816086065]
We focus on a new promising task of scene graph generation (SGG): few-shot SGG (FSSGG)
FSSGG encourages models to be able to quickly transfer previous knowledge and recognize novel predicates with only a few examples.
We propose a novel Decomposed Prototype Learning (DPL)
arXiv Detail & Related papers (2023-03-20T04:54:26Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Integrating Language Guidance into Vision-based Deep Metric Learning [78.18860829585182]
We propose to learn metric spaces which encode semantic similarities as embedding space.
These spaces should be transferable to classes beyond those seen during training.
This causes learned embedding spaces to encode incomplete semantic context and misrepresent the semantic relation between classes.
arXiv Detail & Related papers (2022-03-16T11:06:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.