SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning
- URL: http://arxiv.org/abs/2112.08587v1
- Date: Thu, 16 Dec 2021 03:16:30 GMT
- Title: SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning
- Authors: Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji
Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang
- Abstract summary: multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
- Score: 61.57887011165744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Answering complex questions about images is an ambitious goal for machine
intelligence, which requires a joint understanding of images, text, and
commonsense knowledge, as well as a strong reasoning ability. Recently,
multimodal Transformers have made great progress in the task of Visual
Commonsense Reasoning (VCR), by jointly understanding visual objects and text
tokens through layers of cross-modality attention. However, these approaches do
not utilize the rich structure of the scene and the interactions between
objects which are essential in answering complex commonsense questions. We
propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to
incorporate visual scene graphs in commonsense reasoning. To exploit the scene
graph structure, at the model structure level, we propose a multihop graph
transformer for regularizing attention interaction among hops. As for
pre-training, a scene-graph-aware pre-training method is proposed to leverage
structure knowledge extracted in the visual scene graph. Moreover, we introduce
a method to train and generate domain-relevant visual scene graphs using
textual annotations in a weakly-supervised manner. Extensive experiments on VCR
and other tasks show a significant performance boost compared with the
state-of-the-art methods and prove the efficacy of each proposed component.
Related papers
- VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.
Existing methods struggle to generate scene graphs with novel visual relation concepts.
We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z) - SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based
Question Answering [0.0]
Scene graphs have emerged as a useful tool for multimodal image analysis.
Current methods that utilize idealized annotated scene graphs struggle to generalize when using predicted scene graphs extracted from images.
Our approach extracts a scene graph from an input image using a pre-trained scene graph generator.
arXiv Detail & Related papers (2023-10-03T07:14:53Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question
Answering [13.886692497676659]
Graphhopper is a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques.
We derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships.
A reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths.
arXiv Detail & Related papers (2021-07-13T18:33:04Z) - A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval [4.159666152160874]
Scene graph presentation is a suitable method for the image-text matching challenge.
We introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method.
Our enhancement with the combination of levels can improve the performance of the baseline method by increasing the recall by more than 10% on the Flickr30k dataset.
arXiv Detail & Related papers (2021-06-04T10:33:14Z) - Understanding the Role of Scene Graphs in Visual Question Answering [26.02889386248289]
We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability.
We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs.
We present a multi-faceted study into the use of scene graphs for Visual Question Answering, making this work the first of its kind.
arXiv Detail & Related papers (2021-01-14T07:27:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.