Discovering the Unknown Knowns: Turning Implicit Knowledge in the
Dataset into Explicit Training Examples for Visual Question Answering
- URL: http://arxiv.org/abs/2109.06122v1
- Date: Mon, 13 Sep 2021 16:56:43 GMT
- Title: Discovering the Unknown Knowns: Turning Implicit Knowledge in the
Dataset into Explicit Training Examples for Visual Question Answering
- Authors: Jihyung Kil, Cheng Zhang, Dong Xuan, Wei-Lun Chao
- Abstract summary: We find that many of the "unknowns" to the learned VQA model are indeed "known" in the dataset implicitly.
We present a simple data augmentation pipeline SimpleAug to turn this "known" knowledge into training examples for VQA.
- Score: 18.33311267792116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual question answering (VQA) is challenging not only because the model has
to handle multi-modal information, but also because it is just so hard to
collect sufficient training examples -- there are too many questions one can
ask about an image. As a result, a VQA model trained solely on human-annotated
examples could easily over-fit specific question styles or image contents that
are being asked, leaving the model largely ignorant about the sheer diversity
of questions. Existing methods address this issue primarily by introducing an
auxiliary task such as visual grounding, cycle consistency, or debiasing. In
this paper, we take a drastically different approach. We found that many of the
"unknowns" to the learned VQA model are indeed "known" in the dataset
implicitly. For instance, questions asking about the same object in different
images are likely paraphrases; the number of detected or annotated objects in
an image already provides the answer to the "how many" question, even if the
question has not been annotated for that image. Building upon these insights,
we present a simple data augmentation pipeline SimpleAug to turn this "known"
knowledge into training examples for VQA. We show that these augmented examples
can notably improve the learned VQA models' performance, not only on the VQA-CP
dataset with language prior shifts but also on the VQA v2 dataset without such
shifts. Our method further opens up the door to leverage weakly-labeled or
unlabeled images in a principled way to enhance VQA models. Our code and data
are publicly available at https://github.com/heendung/simpleAUG.
Related papers
- UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
Tasks? A: Self-Train on Unlabeled Images! [103.09776737512077]
SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets.
It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images.
We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
arXiv Detail & Related papers (2023-06-06T18:00:47Z) - OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
Question Answering in Vietnamese [2.7528170226206443]
We introduce the OpenViVQA dataset, the first large-scale dataset for visual question answering in Vietnamese.
The dataset consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs)
Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C.
arXiv Detail & Related papers (2023-05-07T03:59:31Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Self-Supervised VQA: Answering Visual Questions using Images and
Captions [38.05223339919346]
VQA models assume the availability of datasets with human-annotated Image-Question-Answer(I-Q-A) triplets for training.
We study whether models can be trained without any human-annotated Q-A pairs, but only with images and associated text captions.
arXiv Detail & Related papers (2020-12-04T01:22:05Z) - Semantic Equivalent Adversarial Data Augmentation for Visual Question
Answering [65.54116210742511]
Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN)
In this paper, instead of directly manipulating images and questions, we use generated adversarial examples for both images and questions as the augmented data.
We find that we not only improve the overall performance on VQAv2, but also can withstand adversarial attack effectively.
arXiv Detail & Related papers (2020-07-19T05:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.