Towards Solving Multimodal Comprehension
- URL: http://arxiv.org/abs/2104.10139v1
- Date: Tue, 20 Apr 2021 17:30:27 GMT
- Title: Towards Solving Multimodal Comprehension
- Authors: Pritish Sahu, Karan Sikka, and Ajay Divakaran
- Abstract summary: This paper targets the problem of procedural multimodal machine comprehension (M3C)
This task requires an AI to comprehend given steps of multimodal instructions and then answer questions.
- Score: 12.90382979353427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper targets the problem of procedural multimodal machine comprehension
(M3C). This task requires an AI to comprehend given steps of multimodal
instructions and then answer questions. Compared to vanilla machine
comprehension tasks where an AI is required only to understand a textual input,
procedural M3C is more challenging as the AI needs to comprehend both the
temporal and causal factors along with multimodal inputs. Recently Yagcioglu et
al. [35] introduced RecipeQA dataset to evaluate M3C. Our first contribution is
the introduction of two new M3C datasets- WoodworkQA and DecorationQA with 16K
and 10K instructional procedures, respectively. We then evaluate M3C using a
textual cloze style question-answering task and highlight an inherent bias in
the question answer generation method from [35] that enables a naive baseline
to cheat by learning from only answer choices. This naive baseline performs
similar to a popular method used in question answering- Impatient Reader [6]
that uses attention over both the context and the query. We hypothesized that
this naturally occurring bias present in the dataset affects even the best
performing model. We verify our proposed hypothesis and propose an algorithm
capable of modifying the given dataset to remove the bias elements. Finally, we
report our performance on the debiased dataset with several strong baselines.
We observe that the performance of all methods falls by a margin of 8% - 16%
after correcting for the bias. We hope these datasets and the analysis will
provide valuable benchmarks and encourage further research in this area.
Related papers
- Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering [25.577314828249897]
We propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and introducing distribution shifts to split questions.
Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%.
arXiv Detail & Related papers (2024-04-18T09:16:02Z) - Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs [58.620269228776294]
We propose a task-agnostic framework for resolving ambiguity by asking users clarifying questions.
We evaluate systems across three NLP applications: question answering, machine translation and natural language inference.
We find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs.
arXiv Detail & Related papers (2023-11-16T00:18:50Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Unsupervised Question Answering via Answer Diversifying [44.319944418802095]
We propose a novel unsupervised method by diversifying answers, named DiverseQA.
The proposed method is composed of three modules: data construction, data augmentation and denoising filter.
Extensive experiments show that the proposed method outperforms previous unsupervised models on five benchmark datasets.
arXiv Detail & Related papers (2022-08-23T08:57:00Z) - Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To
Benchmark [14.50261153230204]
We focus on Multimodal Machine Reading (M3C) where a model is expected to answer questions based on given passage (or context)
We identify three critical biases stemming from the question-answer generation process and memorization capabilities of large deep models.
We propose a systematic framework to address these biases through three Control-Knobs.
arXiv Detail & Related papers (2021-10-22T16:33:57Z) - Dealing with Missing Modalities in the Visual Question Answer-Difference
Prediction Task through Knowledge Distillation [75.1682163844354]
We address the issues of missing modalities that have arisen from the Visual Question Answer-Difference prediction task.
We introduce a model, the "Big" Teacher, that takes the image/question/answer triplet as its input and outperforms the baseline.
arXiv Detail & Related papers (2021-04-13T06:41:11Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.