The curse of language biases in remote sensing VQA: the role of spatial
attributes, language diversity, and the need for clear evaluation
- URL: http://arxiv.org/abs/2311.16782v1
- Date: Tue, 28 Nov 2023 13:45:15 GMT
- Title: The curse of language biases in remote sensing VQA: the role of spatial
attributes, language diversity, and the need for clear evaluation
- Authors: Christel Chappuis and Eliot Walt and Vincent Mendez and Sylvain Lobry
and Bertrand Le Saux and Devis Tuia
- Abstract summary: The goal of RSVQA is to answer a question formulated in natural language about a remote sensing image.
The problem of language biases is often overlooked in the remote sensing community.
The present work aims at highlighting the problem of language biases in RSVQA with a threefold analysis strategy.
- Score: 32.7348470366509
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Remote sensing visual question answering (RSVQA) opens new opportunities for
the use of overhead imagery by the general public, by enabling human-machine
interaction with natural language. Building on the recent advances in natural
language processing and computer vision, the goal of RSVQA is to answer a
question formulated in natural language about a remote sensing image. Language
understanding is essential to the success of the task, but has not yet been
thoroughly examined in RSVQA. In particular, the problem of language biases is
often overlooked in the remote sensing community, which can impact model
robustness and lead to wrong conclusions about the performances of the model.
Thus, the present work aims at highlighting the problem of language biases in
RSVQA with a threefold analysis strategy: visual blind models, adversarial
testing and dataset analysis. This analysis focuses both on model and data.
Moreover, we motivate the use of more informative and complementary evaluation
metrics sensitive to the issue. The gravity of language biases in RSVQA is then
exposed for all of these methods with the training of models discarding the
image data and the manipulation of the visual input during inference. Finally,
a detailed analysis of question-answer distribution demonstrates the root of
the problem in the data itself. Thanks to this analytical study, we observed
that biases in remote sensing are more severe than in standard VQA, likely due
to the specifics of existing remote sensing datasets for the task, e.g.
geographical similarities and sparsity, as well as a simpler vocabulary and
question generation strategies. While new, improved and less-biased datasets
appear as a necessity for the development of the promising field of RSVQA, we
demonstrate that more informed, relative evaluation metrics remain much needed
to transparently communicate results of future RSVQA methods.
Related papers
- Large Vision-Language Models for Remote Sensing Visual Question Answering [0.0]
Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions.
Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions.
We propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process.
arXiv Detail & Related papers (2024-11-16T18:32:38Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Robust Visual Question Answering: Datasets, Methods, and Future
Challenges [23.59923999144776]
Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question.
Previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers.
Various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively.
arXiv Detail & Related papers (2023-07-21T10:12:09Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Overcoming Language Bias in Remote Sensing Visual Question Answering via
Adversarial Training [22.473676537463607]
Visual Question Answering (VQA) models commonly face the challenge of language bias.
We present a novel framework to reduce the language bias of the VQA for remote sensing data.
arXiv Detail & Related papers (2023-06-01T09:32:45Z) - Multilingual Augmentation for Robust Visual Question Answering in Remote
Sensing Images [19.99615698375829]
We propose a contrastive learning strategy for training robust RSVQA models against diverse question templates and words.
Experimental results demonstrate that the proposed augmented dataset is effective in improving the robustness of the RSVQA model.
arXiv Detail & Related papers (2023-04-07T21:06:58Z) - Overcoming Language Priors with Self-supervised Learning for Visual
Question Answering [62.88124382512111]
Most Visual Question Answering (VQA) models suffer from the language prior problem.
We introduce a self-supervised learning framework to solve this problem.
Our method can significantly outperform the state-of-the-art.
arXiv Detail & Related papers (2020-12-17T12:30:12Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - On the General Value of Evidence, and Bilingual Scene-Text Visual
Question Answering [120.64104995052189]
We present a dataset that takes a step towards addressing this problem in that it contains questions expressed in two languages.
Measuring reasoning directly encourages generalization by penalizing answers that are coincidentally correct.
The dataset reflects the scene-text version of the VQA problem, and the reasoning evaluation can be seen as a text-based version of a referring expression challenge.
arXiv Detail & Related papers (2020-02-24T13:02:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.