Dealing with Missing Modalities in the Visual Question Answer-Difference
Prediction Task through Knowledge Distillation
- URL: http://arxiv.org/abs/2104.05965v1
- Date: Tue, 13 Apr 2021 06:41:11 GMT
- Title: Dealing with Missing Modalities in the Visual Question Answer-Difference
Prediction Task through Knowledge Distillation
- Authors: Jae Won Cho, Dong-Jin Kim, Jinsoo Choi, Yunjae Jung, In So Kweon
- Abstract summary: We address the issues of missing modalities that have arisen from the Visual Question Answer-Difference prediction task.
We introduce a model, the "Big" Teacher, that takes the image/question/answer triplet as its input and outperforms the baseline.
- Score: 75.1682163844354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we address the issues of missing modalities that have arisen
from the Visual Question Answer-Difference prediction task and find a novel
method to solve the task at hand. We address the missing modality-the ground
truth answers-that are not present at test time and use a privileged knowledge
distillation scheme to deal with the issue of the missing modality. In order to
efficiently do so, we first introduce a model, the "Big" Teacher, that takes
the image/question/answer triplet as its input and outperforms the baseline,
then use a combination of models to distill knowledge to a target network
(student) that only takes the image/question pair as its inputs. We experiment
our models on the VizWiz and VQA-V2 Answer Difference datasets and show through
extensive experimentation and ablation the performances of our method and a
diverse possibility for future research.
Related papers
- UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Multi-Image Visual Question Answering [0.0]
We present an empirical study of different feature extraction methods with different loss functions.
We propose New dataset for the task of Visual Question Answering with multiple image inputs having only one ground truth.
Our final model utilising Resnet + RCNN image features and Bert embeddings, inspired by stacked attention network gives 39% word accuracy and 99% image accuracy on CLEVER+TinyImagenet dataset.
arXiv Detail & Related papers (2021-12-27T14:28:04Z) - Glimpse-Attend-and-Explore: Self-Attention for Active Visual Exploration [47.01485765231528]
Active visual exploration aims to assist an agent with a limited field of view to understand its environment based on partial observations.
We propose the Glimpse-Attend-and-Explore model which employs self-attention to guide the visual exploration instead of task-specific uncertainty maps.
Our model provides encouraging results while being less dependent on dataset bias in driving the exploration.
arXiv Detail & Related papers (2021-08-26T11:41:03Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Assisting Scene Graph Generation with Self-Supervision [21.89909688056478]
We propose a set of three novel yet simple self-supervision tasks and train them as auxiliary multi-tasks to the main model.
While comparing, we train the base-model from scratch with these self-supervision tasks, we achieve state-of-the-art results in all the metrics and recall settings.
arXiv Detail & Related papers (2020-08-08T16:38:03Z) - Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual
Question Answering [26.21870452615222]
FVQA requires external knowledge beyond visible content to answer questions about an image.
How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem.
We propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question.
arXiv Detail & Related papers (2020-06-16T11:03:37Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.