Enhancing Visual Dialog Questioner with Entity-based Strategy Learning
and Augmented Guesser
- URL: http://arxiv.org/abs/2109.02297v1
- Date: Mon, 6 Sep 2021 08:58:43 GMT
- Title: Enhancing Visual Dialog Questioner with Entity-based Strategy Learning
and Augmented Guesser
- Authors: Duo Zheng, Zipeng Xu, Fandong Meng, Xiaojie Wang, Jiaan Wang, Jie Zhou
- Abstract summary: We propose a Related entity enhanced Questioner (ReeQ) that generates questions under the guidance of related entities and learns entity-based questioning strategy from human dialogs.
We also propose an Augmented Guesser (AugG) that is strong and is optimized for the VD setting especially.
Experimental results on the VisDial v1.0 dataset show that our approach achieves state-of-theart performance on both image-guessing task and question diversity.
- Score: 43.42833961578857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Considering the importance of building a good Visual Dialog (VD) Questioner,
many researchers study the topic under a Q-Bot-A-Bot image-guessing game
setting, where the Questioner needs to raise a series of questions to collect
information of an undisclosed image. Despite progress has been made in
Supervised Learning (SL) and Reinforcement Learning (RL), issues still exist.
Firstly, previous methods do not provide explicit and effective guidance for
Questioner to generate visually related and informative questions. Secondly,
the effect of RL is hampered by an incompetent component, i.e., the Guesser,
who makes image predictions based on the generated dialogs and assigns rewards
accordingly. To enhance VD Questioner: 1) we propose a Related entity enhanced
Questioner (ReeQ) that generates questions under the guidance of related
entities and learns entity-based questioning strategy from human dialogs; 2) we
propose an Augmented Guesser (AugG) that is strong and is optimized for the VD
setting especially. Experimental results on the VisDial v1.0 dataset show that
our approach achieves state-of-theart performance on both image-guessing task
and question diversity. Human study further proves that our model generates
more visually related, informative and coherent questions.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge [10.074327344317116]
We propose Q&A Prompts to equip AI models with robust cross-modality reasoning ability.
We first use the image-answer pairs and the corresponding questions in a training set as inputs and outputs to train a visual question generation model.
We then use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers.
arXiv Detail & Related papers (2024-01-19T14:22:29Z) - Weakly Supervised Visual Question Answer Generation [2.7605547688813172]
We present a weakly supervised method that synthetically generates question-answer pairs procedurally from visual information and captions.
We perform an exhaustive experimental analysis on VQA dataset and see that our model significantly outperforms SOTA methods on BLEU scores.
arXiv Detail & Related papers (2023-06-11T08:46:42Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue [42.563261906213455]
We propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states.
First, we propose an Answer-Driven Focusing Attention (ADFA) to capture the answer-driven effect on visual attention.
Then based on the focusing attention, we get the visual state estimation by Conditional Visual Information Fusion (CVIF)
arXiv Detail & Related papers (2020-10-01T12:46:38Z) - Knowledgeable Dialogue Reading Comprehension on Key Turns [84.1784903043884]
Multi-choice machine reading comprehension (MRC) requires models to choose the correct answer from candidate options given a passage and a question.
Our research focuses dialogue-based MRC, where the passages are multi-turn dialogues.
It suffers from two challenges, the answer selection decision is made without support of latently helpful commonsense, and the multi-turn context may hide considerable irrelevant information.
arXiv Detail & Related papers (2020-04-29T07:04:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.