Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To
Benchmark
- URL: http://arxiv.org/abs/2110.11899v1
- Date: Fri, 22 Oct 2021 16:33:57 GMT
- Title: Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To
Benchmark
- Authors: Pritish Sahu, Karan Sikka, Ajay Divakaran
- Abstract summary: We focus on Multimodal Machine Reading (M3C) where a model is expected to answer questions based on given passage (or context)
We identify three critical biases stemming from the question-answer generation process and memorization capabilities of large deep models.
We propose a systematic framework to address these biases through three Control-Knobs.
- Score: 14.50261153230204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We focus on Multimodal Machine Reading Comprehension (M3C) where a model is
expected to answer questions based on given passage (or context), and the
context and the questions can be in different modalities. Previous works such
as RecipeQA have proposed datasets and cloze-style tasks for evaluation.
However, we identify three critical biases stemming from the question-answer
generation process and memorization capabilities of large deep models. These
biases makes it easier for a model to overfit by relying on spurious
correlations or naive data patterns. We propose a systematic framework to
address these biases through three Control-Knobs that enable us to generate a
test bed of datasets of progressive difficulty levels. We believe that our
benchmark (referred to as Meta-RecipeQA) will provide, for the first time, a
fine grained estimate of a model's generalization capabilities. We also propose
a general M3C model that is used to realize several prior SOTA models and
motivate a novel hierarchical transformer based reasoning network (HTRN). We
perform a detailed evaluation of these models with different language and
visual features on our benchmark. We observe a consistent improvement with HTRN
over SOTA (~18% in Visual Cloze task and ~13% in average over all the tasks).
We also observe a drop in performance across all the models when testing on
RecipeQA and proposed Meta-RecipeQA (e.g. 83.6% versus 67.1% for HTRN), which
shows that the proposed dataset is relatively less biased. We conclude by
highlighting the impact of the control knobs with some quantitative results.
Related papers
- Model ensemble instead of prompt fusion: a sample-specific knowledge
transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks.
We propose Sample-specific Ensemble of Source Models (SESoM)
SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z) - Reassessing Evaluation Practices in Visual Question Answering: A Case
Study on Out-of-Distribution Generalization [27.437077941786768]
Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks.
We evaluate two pretrained V&L models under different settings by conducting cross-dataset evaluations.
We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task.
arXiv Detail & Related papers (2022-05-24T16:44:45Z) - Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Towards Solving Multimodal Comprehension [12.90382979353427]
This paper targets the problem of procedural multimodal machine comprehension (M3C)
This task requires an AI to comprehend given steps of multimodal instructions and then answer questions.
arXiv Detail & Related papers (2021-04-20T17:30:27Z) - SRQA: Synthetic Reader for Factoid Question Answering [21.28441702154528]
We introduce a new model called SRQA, which means Synthetic Reader for Factoid Question Answering.
This model enhances the question answering system in the multi-document scenario from three aspects.
We perform SRQA on the WebQA dataset, and experiments show that our model outperforms the state-of-the-art models.
arXiv Detail & Related papers (2020-09-02T13:16:24Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.