A Picture May Be Worth a Hundred Words for Visual Question Answering
- URL: http://arxiv.org/abs/2106.13445v1
- Date: Fri, 25 Jun 2021 06:13:14 GMT
- Title: A Picture May Be Worth a Hundred Words for Visual Question Answering
- Authors: Yusuke Hirota, Noa Garcia, Mayu Otani, Chenhui Chu, Yuta Nakashima,
Ittetsu Taniguchi, Takao Onoye
- Abstract summary: In image understanding, it is essential to use concise but detailed image representations.
Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks.
We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model.
- Score: 26.83504716672634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How far can we go with textual representations for understanding pictures? In
image understanding, it is essential to use concise but detailed image
representations. Deep visual features extracted by vision models, such as
Faster R-CNN, are prevailing used in multiple tasks, and especially in visual
question answering (VQA). However, conventional deep visual features may
struggle to convey all the details in an image as we humans do. Meanwhile, with
recent language models' progress, descriptive text may be an alternative to
this problem. This paper delves into the effectiveness of textual
representations for image understanding in the specific context of VQA. We
propose to take description-question pairs as input, instead of deep visual
features, and fed them into a language-only Transformer model, simplifying the
process and the computational cost. We also experiment with data augmentation
techniques to increase the diversity in the training set and avoid learning
statistical bias. Extensive evaluations have shown that textual representations
require only about a hundred words to compete with deep visual features on both
VQA 2.0 and VQA-CP v2.
Related papers
- Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - Learning the Visualness of Text Using Large Vision-Language Models [42.75864384249245]
Visual text evokes an image in a person's mind, while non-visual text fails to do so.
A method to automatically detect visualness in text will enable text-to-image retrieval and generation models to augment text with relevant images.
We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators.
arXiv Detail & Related papers (2023-05-11T17:45:16Z) - Text-Aware Dual Routing Network for Visual Question Answering [11.015339851906287]
Existing approaches often fail in cases that require reading and understanding text in images to answer questions.
We propose a Text-Aware Dual Routing Network (TDR) which simultaneously handles the VQA cases with and without understanding text information in the input images.
In the branch that involves text understanding, we incorporate the Optical Character Recognition (OCR) features into the model to help understand the text in the images.
arXiv Detail & Related papers (2022-11-17T02:02:11Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Image Retrieval from Contextual Descriptions [22.084939474881796]
Image Retrieval from Contextual Descriptions (ImageCoDe)
Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.
Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
arXiv Detail & Related papers (2022-03-29T19:18:12Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene
Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN)
It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively.
It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.