Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene
Text
- URL: http://arxiv.org/abs/2003.13962v1
- Date: Tue, 31 Mar 2020 05:56:59 GMT
- Title: Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene
Text
- Authors: Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, Xilin Chen
- Abstract summary: We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN)
It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively.
It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
- Score: 93.08109196909763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Answering questions that require reading texts in an image is challenging for
current models. One key difficulty of this task is that rare, polysemous, and
ambiguous words frequently appear in images, e.g., names of places, products,
and sports teams. To overcome this difficulty, only resorting to pre-trained
word embedding models is far from enough. A desired model should utilize the
rich information in multiple modalities of the image to help understand the
meaning of scene texts, e.g., the prominent text on a bottle is most likely to
be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal
Graph Neural Network (MM-GNN). It first represents an image as a graph
consisting of three sub-graphs, depicting visual, semantic, and numeric
modalities respectively. Then, we introduce three aggregators which guide the
message passing from one graph to another to utilize the contexts in various
modalities, so as to refine the features of nodes. The updated nodes have
better features for the downstream question answering module. Experimental
evaluations show that our MM-GNN represents the scene texts better and
obviously facilitates the performances on two VQA tasks that require reading
scene texts.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! [14.84123301554462]
We present UNPIE, a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities.
Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings.
The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context.
arXiv Detail & Related papers (2024-10-01T19:32:57Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - A Picture May Be Worth a Hundred Words for Visual Question Answering [26.83504716672634]
In image understanding, it is essential to use concise but detailed image representations.
Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks.
We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model.
arXiv Detail & Related papers (2021-06-25T06:13:14Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.