Related papers: Building Goal-Oriented Dialogue Systems with Situated Visual Context

Building Goal-Oriented Dialogue Systems with Situated Visual Context

URL: http://arxiv.org/abs/2111.11576v1
Date: Mon, 22 Nov 2021 23:30:52 GMT
Title: Building Goal-Oriented Dialogue Systems with Situated Visual Context
Authors: Sanchit Agarwal, Jan Jezabek, Arijit Biswas, Emre Barut, Shuyang Gao, Tagyoung Chung
Abstract summary: With the surge of virtual assistants with screen, the next generation of agents are required to understand screen context. We propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context. Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity.
Score: 12.014793558784955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most popular goal-oriented dialogue agents are capable of understanding the conversational context. However, with the surge of virtual assistants with screen, the next generation of agents are required to also understand screen context in order to provide a proper interactive experience, and better understand users' goals. In this paper, we propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context. Specifically, we propose a new model, that can reason over the visual context within a conversation and populate API arguments with visual entities given the user query. Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity. In order to train our model, due to a lack of suitable multimodal conversational datasets, we also propose a novel multimodal dialog simulator to generate synthetic data and also collect realistic user data from MTurk to improve model robustness. The proposed model achieves a reasonable 85% model accuracy, without high inference latency. We also demonstrate the proposed approach in a prototypical furniture shopping experience for a multimodal virtual assistant.

Related papers

Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z)
Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval [26.585985828583304]
We propose an end-to-end multimodal retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries. To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval dataset automatically constructed from visual dialogue datasets. We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios.
arXiv Detail & Related papers (2024-11-13T04:32:58Z)
ChatterBox: Multi-round Multimodal Referring and Grounding [108.9673313949746]
We present a new benchmark and an efficient vision-language model for this purpose. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-24T09:02:00Z)
DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z)
'What are you referring to?' Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges [65.03196674816772]
Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarification Exchanges (CE): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models.
arXiv Detail & Related papers (2023-07-28T13:44:33Z)
IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue. We propose a two-stage approach to automatically construct a multi-modal dialogue dataset. In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image. In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z)
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [96.33509740612486]
MM-REACT is a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information.
arXiv Detail & Related papers (2023-03-20T18:31:47Z)
Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering. We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z)
Representation Learning for Conversational Data using Discourse Mutual Information Maximization [9.017156603976915]
We argue that the structure-unaware word-by-word generation is not suitable for effective conversation modeling. We propose a structure-aware Mutual Information based loss-function DMI for training dialog-representation models. Our models show the most promising performance on the dialog evaluation task DailyDialog++, in both random and adversarial negative scenarios.
arXiv Detail & Related papers (2021-12-04T13:17:07Z)
Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation [35.45552689723718]
We propose frameworks to resolve a specific case of multi-modal dialog generation in the real world. Specifically, we propose to model the mutual dependency between text-visual features. We observe significant performance boosts over vanilla models when the mutual dependency between text and visual features is modeled.
arXiv Detail & Related papers (2021-05-30T07:20:28Z)
Variational Hierarchical Dialog Autoencoder for Dialog State Tracking Data Augmentation [59.174903564894954]
In this work, we extend this approach to the task of dialog state tracking for goal-oriented dialogs. We propose the Variational Hierarchical Dialog Autoencoder (VHDA) for modeling the complete aspects of goal-oriented dialogs. Experiments on various dialog datasets show that our model improves the downstream dialog trackers' robustness via generative data augmentation.
arXiv Detail & Related papers (2020-01-23T15:34:56Z)
Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue. We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history. We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.