Related papers: Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

URL: http://arxiv.org/abs/2205.12191v2
Date: Sat, 1 Apr 2023 07:07:44 GMT
Title: Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization
Authors: Aishwarya Agrawal, Ivana Kaji\'c, Emanuele Bugliarello, Elnaz Davoodi, Anita Gergely, Phil Blunsom, Aida Nematzadeh
Abstract summary: Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks. We evaluate two pretrained V&L models under different settings by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task.
Score: 27.437077941786768
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (out-of-dataset) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively evaluate two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is generally helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.

Related papers

Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision [49.46606936180063]
Video quality assessment (VQA) is essential for quantifying quality in various video processing systems.<n>We introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos.<n>By training on a dataset $10times$ larger than the existing VQA benchmarks, our model achieves zero-shot performance.
arXiv Detail & Related papers (2025-05-06T15:29:32Z)
Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis [36.689210473887904]
We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models. We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results) Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
arXiv Detail & Related papers (2025-03-18T15:40:18Z)
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context. We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z)
BloomVQA: Assessing Hierarchical Multi-modal Comprehension [18.21961616174999]
We collect multiple-choice samples based on picture stories that reflect different levels of comprehension. Our data maps to a novel hierarchical graph representation which enables automatic data augmentation and novel measures characterizing model consistency. In comparison to earlier models, GPT-4V demonstrates improved accuracy over all comprehension levels and shows a tendency of bypassing visual inputs especially for higher-level tasks.
arXiv Detail & Related papers (2023-12-20T02:22:49Z)
Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models [3.9052860539161918]
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature. We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA) We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
arXiv Detail & Related papers (2023-05-11T14:35:00Z)
Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning [54.61762276179205]
We propose a novel contrastive learning approach, MMBS, for building robust VQA models by Making the Most of Biased Samples. Specifically, we construct positive samples for contrastive learning by eliminating the information related to spurious correlation from the original training samples. We validate our contributions by achieving competitive performance on the OOD dataset VQA-CP v2 while preserving robust performance on the ID dataset VQA v2.
arXiv Detail & Related papers (2022-10-10T11:05:21Z)
CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z)
General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input. MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z)
Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions. We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.