Unleashing the Potential of Large Language Model: Zero-shot VQA for
Flood Disaster Scenario
- URL: http://arxiv.org/abs/2312.01882v1
- Date: Mon, 4 Dec 2023 13:25:16 GMT
- Title: Unleashing the Potential of Large Language Model: Zero-shot VQA for
Flood Disaster Scenario
- Authors: Yimin Sun, Chao Wang, Yan Peng
- Abstract summary: We propose a zero-shot VQA model named Zero-shot VQA for Flood Disaster Damage Assessment (ZFDDA)
With flood disaster as the main research object, we build a Freestyle Flood Disaster Image Question Answering dataset (FFD-IQA)
This new dataset expands the question types to include free-form, multiple-choice, and yes-no questions.
Our model uses well-designed chain of thought (CoT) demonstrations to unlock the potential of the large language model.
- Score: 6.820160182829294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual question answering (VQA) is a fundamental and essential AI task, and
VQA-based disaster scenario understanding is a hot research topic. For
instance, we can ask questions about a disaster image by the VQA model and the
answer can help identify whether anyone or anything is affected by the
disaster. However, previous VQA models for disaster damage assessment have some
shortcomings, such as limited candidate answer space, monotonous question
types, and limited answering capability of existing models. In this paper, we
propose a zero-shot VQA model named Zero-shot VQA for Flood Disaster Damage
Assessment (ZFDDA). It is a VQA model for damage assessment without
pre-training. Also, with flood disaster as the main research object, we build a
Freestyle Flood Disaster Image Question Answering dataset (FFD-IQA) to evaluate
our VQA model. This new dataset expands the question types to include
free-form, multiple-choice, and yes-no questions. At the same time, we expand
the size of the previous dataset to contain a total of 2,058 images and 22,422
question-meta ground truth pairs. Most importantly, our model uses
well-designed chain of thought (CoT) demonstrations to unlock the potential of
the large language model, allowing zero-shot VQA to show better performance in
disaster scenarios. The experimental results show that the accuracy in
answering complex questions is greatly improved with CoT prompts. Our study
provides a research basis for subsequent research of VQA for other disaster
scenarios.
Related papers
- Reducing Hallucinations: Enhancing VQA for Flood Disaster Damage
Assessment with Visual Contexts [6.820160182829294]
We propose a zero-shot VQA named Flood Disaster VQA with Two-Stage Prompt (VQA-TSP)
The model generates the thought process in the first stage and then uses the thought process to generate the final answer in the second stage.
Our method exceeds the performance of state-of-the-art zero-shot VQA models for flood disaster scenarios in total.
arXiv Detail & Related papers (2023-12-21T13:45:02Z) - Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Continual VQA for Disaster Response Systems [0.0]
Visual Question Answering (VQA) is a multi-modal task that involves answering questions from an input image.
Main challenge is the delay caused by the generation of labels in the assessment of the affected areas.
We deploy pre-trained CLIP model, which is trained on visual-image pairs.
We surpass previous state-of-the-art results on the FloodNet dataset.
arXiv Detail & Related papers (2022-09-21T12:45:51Z) - Reliable Visual Question Answering: Abstain Rather Than Answer
Incorrectly [100.60560477391732]
We promote a problem formulation for reliable visual question answering (VQA)
We analyze both their coverage, the portion of questions answered, and risk, the error on that portion.
We find that although the best performing models achieve over 71% accuracy on the VQA v2 dataset, introducing the option to abstain limits them to answering less than 8% of the questions to achieve a low risk of error (i.e., 1%)
This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can triple the coverage from, for example, 5.0% to 16.7% at
arXiv Detail & Related papers (2022-04-28T16:51:27Z) - VQA-Aid: Visual Question Answering for Post-Disaster Damage Assessment
and Analysis [0.7614628596146599]
Visual Question Answering system integrated with Unmanned Aerial Vehicle (UAV) has a lot of potentials to advance the post-disaster damage assessment purpose.
We present our recently developed VQA dataset called textitHurMic-VQA collected during hurricane Michael.
arXiv Detail & Related papers (2021-06-19T18:28:16Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.