SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for
Scene Graph Generation
- URL: http://arxiv.org/abs/2212.09329v1
- Date: Mon, 19 Dec 2022 09:47:27 GMT
- Title: SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for
Scene Graph Generation
- Authors: Yuxiang Zhang, Zhenbo Liu, Shuai Wang
- Abstract summary: One-stage scene graph generation approaches infer the effective relation between entity pairs using sparse proposal sets and a few queries.
A Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model.
Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced.
- Score: 12.977857322594206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objects in a scene are not always related. The execution efficiency of the
one-stage scene graph generation approaches are quite high, which infer the
effective relation between entity pairs using sparse proposal sets and a few
queries. However, they only focus on the relation between subject and object in
triplet set subject entity, predicate entity, object entity, ignoring the
relation between subject and predicate or predicate and object, and the model
lacks self-reasoning ability. In addition, linguistic modality has been
neglected in the one-stage method. It is necessary to mine linguistic modality
knowledge to improve model reasoning ability. To address the above-mentioned
shortcomings, a Self-reasoning Transformer with Visual-linguistic Knowledge
(SrTR) is proposed to add flexible self-reasoning ability to the model. An
encoder-decoder architecture is adopted in SrTR, and a self-reasoning decoder
is developed to complete three inferences of the triplet set, s+o-p, s+p-o and
p+o-s. Inspired by the large-scale pre-training image-text foundation models,
visual-linguistic prior knowledge is introduced and a visual-linguistic
alignment strategy is designed to project visual representations into semantic
spaces with prior knowledge to aid relational reasoning. Experiments on the
Visual Genome dataset demonstrate the superiority and fast inference ability of
the proposed method.
Related papers
- InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with
Semantic Graph Prior [27.773451301040424]
InstructScene is a novel generative framework that integrates a semantic graph prior and a layout decoder.
We show that the proposed method surpasses existing state-of-the-art approaches by a large margin.
arXiv Detail & Related papers (2024-02-07T10:09:00Z) - 3VL: using Trees to teach Vision & Language models compositional
concepts [45.718319397947056]
We introduce the Tree-augmented Vision-Language (3VL) model architecture and training technique.
We show how Anchor, a simple technique for text unification, can be employed to filter nuisance factors.
We also exhibit how DiRe, which performs a differential relevancy comparison between VLM maps, enables us to generate compelling visualizations of a model's success or failure.
arXiv Detail & Related papers (2023-12-28T20:26:03Z) - RelViT: Concept-guided Vision Transformer for Visual Relational
Reasoning [139.0548263507796]
We use vision transformers (ViTs) as our base model for visual reasoning.
We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs.
We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
arXiv Detail & Related papers (2022-04-24T02:46:43Z) - RelTR: Relation Transformer for Scene Graph Generation [34.1193503312965]
We propose an end-to-end scene graph generation model RelTR with an encoder-decoder architecture.
The model infers a fixed-size set of triplets subject-predicate-object using different types of attention mechanisms.
Experiments on the Visual Genome and Open Images V6 datasets demonstrate the superior performance and fast inference of our model.
arXiv Detail & Related papers (2022-01-27T11:53:41Z) - TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D
Visual Grounding [15.617150859765024]
We exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data.
We propose a TransRefer3D network to extract entity-and-relation aware multimodal context.
Our proposed model significantly outperforms existing approaches by up to 10.6%.
arXiv Detail & Related papers (2021-08-05T05:47:12Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.