SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning
- URL: http://arxiv.org/abs/2112.08587v1
- Date: Thu, 16 Dec 2021 03:16:30 GMT
- Title: SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning
- Authors: Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji
Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang
- Abstract summary: multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
- Score: 61.57887011165744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Answering complex questions about images is an ambitious goal for machine
intelligence, which requires a joint understanding of images, text, and
commonsense knowledge, as well as a strong reasoning ability. Recently,
multimodal Transformers have made great progress in the task of Visual
Commonsense Reasoning (VCR), by jointly understanding visual objects and text
tokens through layers of cross-modality attention. However, these approaches do
not utilize the rich structure of the scene and the interactions between
objects which are essential in answering complex commonsense questions. We
propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to
incorporate visual scene graphs in commonsense reasoning. To exploit the scene
graph structure, at the model structure level, we propose a multihop graph
transformer for regularizing attention interaction among hops. As for
pre-training, a scene-graph-aware pre-training method is proposed to leverage
structure knowledge extracted in the visual scene graph. Moreover, we introduce
a method to train and generate domain-relevant visual scene graphs using
textual annotations in a weakly-supervised manner. Extensive experiments on VCR
and other tasks show a significant performance boost compared with the
state-of-the-art methods and prove the efficacy of each proposed component.
Related papers
- Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing [46.701439459096235]
We propose a novel visual commonsense reasoning generation method named textittextbfG2.
It first utilizes the image patches and LLMs to construct a location-free scene graph, and then answer and explain based on the scene graph's information.
We also propose automatic scene graph filtering and selection strategies to absorb valuable scene graph information during training.
arXiv Detail & Related papers (2025-01-15T04:00:36Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.
Existing methods struggle to generate scene graphs with novel visual relation concepts.
We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z) - SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based
Question Answering [0.0]
Scene graphs have emerged as a useful tool for multimodal image analysis.
Current methods that utilize idealized annotated scene graphs struggle to generalize when using predicted scene graphs extracted from images.
Our approach extracts a scene graph from an input image using a pre-trained scene graph generator.
arXiv Detail & Related papers (2023-10-03T07:14:53Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question
Answering [13.886692497676659]
Graphhopper is a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques.
We derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships.
A reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths.
arXiv Detail & Related papers (2021-07-13T18:33:04Z) - Understanding the Role of Scene Graphs in Visual Question Answering [26.02889386248289]
We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability.
We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs.
We present a multi-faceted study into the use of scene graphs for Visual Question Answering, making this work the first of its kind.
arXiv Detail & Related papers (2021-01-14T07:27:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.