SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal
Conversations
- URL: http://arxiv.org/abs/2104.08667v1
- Date: Sun, 18 Apr 2021 00:14:29 GMT
- Title: SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal
Conversations
- Authors: Satwik Kottur, Seungwhan Moon, Alborz Geramifard, Babak Damavandi
- Abstract summary: We present a new corpus for the Situated and Interactive Multimodal Conversations, SIMMC 2.0, aimed at building a successful multimodal assistant agent.
The dataset features 11K task-oriented dialogs (117K utterances) between a user and a virtual assistant on the shopping domain.
- Score: 9.626560177660634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new corpus for the Situated and Interactive Multimodal
Conversations, SIMMC 2.0, aimed at building a successful multimodal assistant
agent. Specifically, the dataset features 11K task-oriented dialogs (117K
utterances) between a user and a virtual assistant on the shopping domain
(fashion and furniture), grounded in situated and photo-realistic VR scenes.
The dialogs are collected using a two-phase pipeline, which first generates
simulated dialog flows via a novel multimodal dialog simulator we propose,
followed by manual paraphrasing of the generated utterances. In this paper, we
provide an in-depth analysis of the collected dataset, and describe in detail
the four main benchmark tasks we propose for SIMMC 2.0. The preliminary
analysis with a baseline model highlights the new challenges that the SIMMC 2.0
dataset brings, suggesting new directions for future research. Our dataset and
code will be made publicly available.
Related papers
- MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans [4.098892268127572]
We present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR)
Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings.
arXiv Detail & Related papers (2024-09-30T21:51:30Z) - ChatterBox: Multi-round Multimodal Referring and Grounding [108.9673313949746]
We present a new benchmark and an efficient vision-language model for this purpose.
The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks.
Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-24T09:02:00Z) - Which One Are You Referring To? Multimodal Object Identification in
Situated Dialogue [50.279206765971125]
We explore three methods to tackle the problem of interpreting multimodal inputs from conversational and situational contexts.
Our best method, scene-dialogue alignment, improves the performance by 20% F1-score compared to the SIMMC 2.1 baselines.
arXiv Detail & Related papers (2023-02-28T15:45:20Z) - Navigating Connected Memories with a Task-oriented Dialog System [13.117491508194242]
We propose dialogs for connected memories as a powerful tool to empower users to search their media collection through a multi-turn, interactive conversation.
We use a new task-oriented dialog dataset COMET, which contains $11.5k$ user->assistant dialogs (totaling $103k$ utterances) grounded in simulated personal memory graphs.
We analyze COMET, formulate four main tasks to benchmark meaningful progress, and adopt state-of-the-art language models as strong baselines.
arXiv Detail & Related papers (2022-11-15T19:31:57Z) - Information Extraction and Human-Robot Dialogue towards Real-life Tasks:
A Baseline Study with the MobileCS Dataset [52.22314870976088]
The SereTOD challenge is organized and releases the MobileCS dataset, which consists of real-world dialog transcripts between real users and customer-service staffs from China Mobile.
Based on the MobileCS dataset, the SereTOD challenge has two tasks, not only evaluating the construction of the dialogue system itself, but also examining information extraction from dialog transcripts.
This paper mainly presents a baseline study of the two tasks with the MobileCS dataset.
arXiv Detail & Related papers (2022-09-27T15:30:43Z) - Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z) - HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on
Tabular and Textual Data [87.67278915655712]
We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables.
The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions.
arXiv Detail & Related papers (2022-04-28T00:52:16Z) - DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for
Dialog Response Generation [80.45816053153722]
DialogVED introduces continuous latent variables into the enhanced encoder-decoder pre-training framework to increase the relevance and diversity of responses.
We conduct experiments on PersonaChat, DailyDialog, and DSTC7-AVSD benchmarks for response generation.
arXiv Detail & Related papers (2022-04-27T16:18:15Z) - Multimodal Interactions Using Pretrained Unimodal Models for SIMMC 2.0 [1.599072005190786]
This paper presents our work on the Situated Interactive MultiModal Conversations 2.0 challenge held at Dialog State Tracking Challenge 10.
We introduce our multimodal approaches for the subtask #1, #2 and the generation of subtask #5.
We achieve the 3rd best performance in subtask #1, #2 and a runner-up in the generation of subtask #5.
arXiv Detail & Related papers (2021-12-10T04:20:08Z) - Building Goal-Oriented Dialogue Systems with Situated Visual Context [12.014793558784955]
With the surge of virtual assistants with screen, the next generation of agents are required to understand screen context.
We propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context.
Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity.
arXiv Detail & Related papers (2021-11-22T23:30:52Z) - Situated and Interactive Multimodal Conversations [21.391260370502224]
We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents.
We provide two SIMMC datasets totalling 13K human-human dialogs (169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup.
We present several tasks within SIMMC as objective evaluation protocols, such as Structural API Prediction and Response Generation.
arXiv Detail & Related papers (2020-06-02T09:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.