Spot the Difference: A Cooperative Object-Referring Game in
Non-Perfectly Co-Observable Scene
- URL: http://arxiv.org/abs/2203.08362v1
- Date: Wed, 16 Mar 2022 02:55:33 GMT
- Title: Spot the Difference: A Cooperative Object-Referring Game in
Non-Perfectly Co-Observable Scene
- Authors: Duo Zheng, Fandong Meng, Qingyi Si, Hairun Fan, Zipeng Xu, Jie Zhou,
Fangxiang Feng, Xiaojie Wang
- Abstract summary: This paper proposes an object-referring game in non-perfectly co-observable visual scene.
The goal is to spot the difference between the similar visual scenes through conversing in natural language.
We construct a large-scale multimodal dataset, named SpotDiff, which contains 87k Virtual Reality images and 97k dialogs generated by self-play.
- Score: 47.7861036048079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual dialog has witnessed great progress after introducing various
vision-oriented goals into the conversation, especially such as GuessWhich and
GuessWhat, where the only image is visible by either and both of the questioner
and the answerer, respectively. Researchers explore more on visual dialog tasks
in such kind of single- or perfectly co-observable visual scene, while somewhat
neglect the exploration on tasks of non perfectly co-observable visual scene,
where the images accessed by two agents may not be exactly the same, often
occurred in practice. Although building common ground in non-perfectly
co-observable visual scene through conversation is significant for advanced
dialog agents, the lack of such dialog task and corresponding large-scale
dataset makes it impossible to carry out in-depth research. To break this
limitation, we propose an object-referring game in non-perfectly co-observable
visual scene, where the goal is to spot the difference between the similar
visual scenes through conversing in natural language. The task addresses
challenges of the dialog strategy in non-perfectly co-observable visual scene
and the ability of categorizing objects. Correspondingly, we construct a
large-scale multimodal dataset, named SpotDiff, which contains 87k Virtual
Reality images and 97k dialogs generated by self-play. Finally, we give
benchmark models for this task, and conduct extensive experiments to evaluate
its performance as well as analyze its main challenges.
Related papers
- Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - Unsupervised Object-Centric Learning from Multiple Unspecified
Viewpoints [45.88397367354284]
We consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision.
We propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem.
Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.
arXiv Detail & Related papers (2024-01-03T15:09:25Z) - Supplementing Missing Visions via Dialog for Scene Graph Generations [14.714122626081064]
We investigate a computer vision task setting with incomplete visual input data.
We propose to supplement the missing visions via the natural language dialog interactions to better accomplish the task objective.
We demonstrate the feasibility of such a task setting with missing visual input and the effectiveness of our proposed dialog module as the supplementary information source.
arXiv Detail & Related papers (2022-04-23T21:46:17Z) - Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering.
We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z) - Unsupervised Learning of Compositional Scene Representations from
Multiple Unspecified Viewpoints [41.07379505694274]
We consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision.
We propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem.
Experiments on several specifically designed synthetic datasets have shown that the proposed method is able to effectively learn from multiple unspecified viewpoints.
arXiv Detail & Related papers (2021-12-07T08:45:21Z) - Multimodal Incremental Transformer with Visual Grounding for Visual
Dialogue Generation [25.57530524167637]
Visual dialogue needs to answer a series of coherent questions on the basis of understanding the visual environment.
Visual grounding aims to explicitly locate related objects in the image guided by textual entities.
multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response.
arXiv Detail & Related papers (2021-09-17T11:39:29Z) - Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition [57.088328223220934]
Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world.
In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene.
arXiv Detail & Related papers (2021-04-12T11:37:23Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.