Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue
- URL: http://arxiv.org/abs/2010.00361v2
- Date: Thu, 24 Mar 2022 12:55:15 GMT
- Title: Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue
- Authors: Zipeng Xu, Fangxiang Feng, Xiaojie Wang, Yushu Yang, Huixing Jiang,
Zhongyuan Wang
- Abstract summary: We propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states.
First, we propose an Answer-Driven Focusing Attention (ADFA) to capture the answer-driven effect on visual attention.
Then based on the focusing attention, we get the visual state estimation by Conditional Visual Information Fusion (CVIF)
- Score: 42.563261906213455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A goal-oriented visual dialogue involves multi-turn interactions between two
agents, Questioner and Oracle. During which, the answer given by Oracle is of
great significance, as it provides golden response to what Questioner concerns.
Based on the answer, Questioner updates its belief on target visual content and
further raises another question. Notably, different answers drive into
different visual beliefs and future questions. However, existing methods always
indiscriminately encode answers after much longer questions, resulting in a
weak utilization of answers. In this paper, we propose an Answer-Driven Visual
State Estimator (ADVSE) to impose the effects of different answers on visual
states. First, we propose an Answer-Driven Focusing Attention (ADFA) to capture
the answer-driven effect on visual attention by sharpening question-related
attention and adjusting it by answer-based logical operation at each turn. Then
based on the focusing attention, we get the visual state estimation by
Conditional Visual Information Fusion (CVIF), where overall information and
difference information are fused conditioning on the question-answer state. We
evaluate the proposed ADVSE to both question generator and guesser tasks on the
large-scale GuessWhat?! dataset and achieve the state-of-the-art performances
on both tasks. The qualitative results indicate that the ADVSE boosts the agent
to generate highly efficient questions and obtains reliable visual attentions
during the reasonable question generation and guess processes.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Weakly Supervised Visual Question Answer Generation [2.7605547688813172]
We present a weakly supervised method that synthetically generates question-answer pairs procedurally from visual information and captions.
We perform an exhaustive experimental analysis on VQA dataset and see that our model significantly outperforms SOTA methods on BLEU scores.
arXiv Detail & Related papers (2023-06-11T08:46:42Z) - SC-ML: Self-supervised Counterfactual Metric Learning for Debiased
Visual Question Answering [10.749155815447127]
We propose a self-supervised counterfactual metric learning (SC-ML) method to focus the image features better.
SC-ML can adaptively select the question-relevant visual features to answer the question, reducing the negative influence of question-irrelevant visual features on inferring answers.
arXiv Detail & Related papers (2023-04-04T09:05:11Z) - Equivariant and Invariant Grounding for Video Question Answering [68.33688981540998]
Most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure.
We devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV)
EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment.
arXiv Detail & Related papers (2022-07-26T10:01:02Z) - Grounding Answers for Visual Questions Asked by Visually Impaired People [16.978747012406266]
VizWiz-VQA-Grounding is the first dataset that visually grounds answers to visual questions asked by people with visual impairments.
We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different.
arXiv Detail & Related papers (2022-02-04T06:47:16Z) - Enhancing Visual Dialog Questioner with Entity-based Strategy Learning
and Augmented Guesser [43.42833961578857]
We propose a Related entity enhanced Questioner (ReeQ) that generates questions under the guidance of related entities and learns entity-based questioning strategy from human dialogs.
We also propose an Augmented Guesser (AugG) that is strong and is optimized for the VD setting especially.
Experimental results on the VisDial v1.0 dataset show that our approach achieves state-of-theart performance on both image-guessing task and question diversity.
arXiv Detail & Related papers (2021-09-06T08:58:43Z) - Check It Again: Progressive Visual Question Answering via Visual
Entailment [12.065178204539693]
We propose a select-and-rerank (SAR) progressive framework based on Visual Entailment.
We first select the candidate answers relevant to the question or the image, then we rerank the candidate answers by a visual entailment task.
Experimental results show the effectiveness of our proposed framework, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55% improvement.
arXiv Detail & Related papers (2021-06-08T18:00:38Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - On the General Value of Evidence, and Bilingual Scene-Text Visual
Question Answering [120.64104995052189]
We present a dataset that takes a step towards addressing this problem in that it contains questions expressed in two languages.
Measuring reasoning directly encourages generalization by penalizing answers that are coincidentally correct.
The dataset reflects the scene-text version of the VQA problem, and the reasoning evaluation can be seen as a text-based version of a referring expression challenge.
arXiv Detail & Related papers (2020-02-24T13:02:31Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.