Situated and Interactive Multimodal Conversations
- URL: http://arxiv.org/abs/2006.01460v2
- Date: Tue, 10 Nov 2020 20:21:19 GMT
- Title: Situated and Interactive Multimodal Conversations
- Authors: Seungwhan Moon, Satwik Kottur, Paul A. Crook, Ankita De, Shivani
Poddar, Theodore Levin, David Whitney, Daniel Difranco, Ahmad Beirami,
Eunjoon Cho, Rajen Subba, Alborz Geramifard
- Abstract summary: We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents.
We provide two SIMMC datasets totalling 13K human-human dialogs (169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup.
We present several tasks within SIMMC as objective evaluation protocols, such as Structural API Prediction and Response Generation.
- Score: 21.391260370502224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Next generation virtual assistants are envisioned to handle multimodal inputs
(e.g., vision, memories of previous interactions, in addition to the user's
utterances), and perform multimodal actions (e.g., displaying a route in
addition to generating the system's utterance). We introduce Situated
Interactive MultiModal Conversations (SIMMC) as a new direction aimed at
training agents that take multimodal actions grounded in a co-evolving
multimodal input context in addition to the dialog history. We provide two
SIMMC datasets totalling ~13K human-human dialogs (~169K utterances) using a
multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture
(grounded in a shared virtual environment) and, (b) fashion (grounded in an
evolving set of images). We also provide logs of the items appearing in each
scene, and contextual NLU and coreference annotations, using a novel and
unified framework of SIMMC conversational acts for both user and assistant
utterances. Finally, we present several tasks within SIMMC as objective
evaluation protocols, such as Structural API Prediction and Response
Generation. We benchmark a collection of existing models on these SIMMC tasks
as strong baselines, and demonstrate rich multimodal conversational
interactions. Our data, annotations, code, and models are publicly available.
Related papers
- MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z) - ChatterBox: Multi-round Multimodal Referring and Grounding [108.9673313949746]
We present a new benchmark and an efficient vision-language model for this purpose.
The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks.
Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-24T09:02:00Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Multitask Multimodal Prompted Training for Interactive Embodied Task
Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories.
By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z) - Revisiting Disentanglement and Fusion on Modality and Context in
Conversational Multimodal Emotion Recognition [81.2011058113579]
We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps.
We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration.
Our system achieves new state-of-the-art performance consistently.
arXiv Detail & Related papers (2023-08-08T18:11:27Z) - Which One Are You Referring To? Multimodal Object Identification in
Situated Dialogue [50.279206765971125]
We explore three methods to tackle the problem of interpreting multimodal inputs from conversational and situational contexts.
Our best method, scene-dialogue alignment, improves the performance by 20% F1-score compared to the SIMMC 2.1 baselines.
arXiv Detail & Related papers (2023-02-28T15:45:20Z) - Building Goal-Oriented Dialogue Systems with Situated Visual Context [12.014793558784955]
With the surge of virtual assistants with screen, the next generation of agents are required to understand screen context.
We propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context.
Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity.
arXiv Detail & Related papers (2021-11-22T23:30:52Z) - SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal
Conversations [9.626560177660634]
We present a new corpus for the Situated and Interactive Multimodal Conversations, SIMMC 2.0, aimed at building a successful multimodal assistant agent.
The dataset features 11K task-oriented dialogs (117K utterances) between a user and a virtual assistant on the shopping domain.
arXiv Detail & Related papers (2021-04-18T00:14:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.