Unified Questioner Transformer for Descriptive Question Generation in
Goal-Oriented Visual Dialogue
- URL: http://arxiv.org/abs/2106.15550v1
- Date: Tue, 29 Jun 2021 16:36:34 GMT
- Title: Unified Questioner Transformer for Descriptive Question Generation in
Goal-Oriented Visual Dialogue
- Authors: Shoya Matsumori, Kosuke Shingyouchi, Yuki Abe, Yosuke Fukuchi, Komei
Sugiura, and Michita Imai
- Abstract summary: Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems.
We propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer)
We build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building an interactive artificial intelligence that can ask questions about
the real world is one of the biggest challenges for vision and language
problems. In particular, goal-oriented visual dialogue, where the aim of the
agent is to seek information by asking questions during a turn-taking dialogue,
has been gaining scholarly attention recently. While several existing models
based on the GuessWhat?! dataset have been proposed, the Questioner typically
asks simple category-based questions or absolute spatial questions. This might
be problematic for complex scenes where the objects share attributes or in
cases where descriptive questions are required to distinguish objects. In this
paper, we propose a novel Questioner architecture, called Unified Questioner
Transformer (UniQer), for descriptive question generation with referring
expressions. In addition, we build a goal-oriented visual dialogue task called
CLEVR Ask. It synthesizes complex scenes that require the Questioner to
generate descriptive questions. We train our model with two variants of CLEVR
Ask datasets. The results of the quantitative and qualitative evaluations show
that UniQer outperforms the baseline.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - CommVQA: Situating Visual Question Answering in Communicative Contexts [16.180130883242672]
We introduce CommVQA, a dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear.
We show that access to contextual information is essential for solving CommVQA, leading to the highest performing VQA model.
arXiv Detail & Related papers (2024-02-22T22:31:39Z) - Qsnail: A Questionnaire Dataset for Sequential Question Generation [76.616068047362]
We present the first dataset specifically constructed for the questionnaire generation task, which comprises 13,168 human-written questionnaires.
We conduct experiments on Qsnail, and the results reveal that retrieval models and traditional generative models do not fully align with the given research topic and intents.
Despite enhancements through the chain-of-thought prompt and finetuning, questionnaires generated by language models still fall short of human-written questionnaires.
arXiv Detail & Related papers (2024-02-22T04:14:10Z) - Keeping the Questions Conversational: Using Structured Representations
to Resolve Dependency in Conversational Question Answering [26.997542897342164]
We propose a novel framework, CONVSR (CONVQA using Structured Representations) for capturing and generating intermediate representations as conversational cues.
We test our model on the QuAC and CANARD datasets and illustrate by experimental results that our proposed framework achieves a better F1 score than the standard question rewriting model.
arXiv Detail & Related papers (2023-04-14T13:42:32Z) - Equivariant and Invariant Grounding for Video Question Answering [68.33688981540998]
Most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure.
We devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV)
EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment.
arXiv Detail & Related papers (2022-07-26T10:01:02Z) - Video Dialog as Conversation about Objects Living in Space-Time [35.54055886856042]
We present a new object-centric framework for video dialog that supports neural reasoning dubbed COST.
COST maintains and tracks object-associated dialog states, which are updated upon receiving new questions.
We evaluate COST on the DSTC7 and DSTC8 benchmarks, demonstrating its competitiveness against state-of-the-arts.
arXiv Detail & Related papers (2022-07-08T02:34:38Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - Evaluating Mixed-initiative Conversational Search Systems via User
Simulation [9.066817876491053]
We propose a conversational User Simulator, called USi, for automatic evaluation of such search systems.
We show that responses generated by USi are both inline with the underlying information need and comparable to human-generated answers.
arXiv Detail & Related papers (2022-04-17T16:27:33Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - ConvAI3: Generating Clarifying Questions for Open-Domain Dialogue
Systems (ClariQ) [64.60303062063663]
This document presents a detailed description of the challenge on clarifying questions for dialogue systems (ClariQ)
The challenge is organized as part of the Conversational AI challenge series (ConvAI3) at Search Oriented Conversational AI (SCAI) EMNLP workshop in 2020.
arXiv Detail & Related papers (2020-09-23T19:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.