Multi-Modal Open-Domain Dialogue
- URL: http://arxiv.org/abs/2010.01082v1
- Date: Fri, 2 Oct 2020 16:20:39 GMT
- Title: Multi-Modal Open-Domain Dialogue
- Authors: Kurt Shuster, Eric Michael Smith, Da Ju, Jason Weston
- Abstract summary: Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling.
We investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models.
We show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously performing as well as its predecessor.
- Score: 28.69395893943413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work in open-domain conversational agents has demonstrated that
significant improvements in model engagingness and humanness metrics can be
achieved via massive scaling in both pre-training data and model size
(Adiwardana et al., 2020; Roller et al., 2020). However, if we want to build
agents with human-like abilities, we must expand beyond handling just text. A
particularly important topic is the ability to see images and communicate about
what is perceived. With the goal of engaging humans in multi-modal dialogue, we
investigate combining components from state-of-the-art open-domain dialogue
agents with those from state-of-the-art vision models. We study incorporating
different image fusion schemes and domain-adaptive pre-training and fine-tuning
strategies, and show that our best resulting model outperforms strong existing
models in multi-modal dialogue while simultaneously performing as well as its
predecessor (text-only) BlenderBot (Roller et al., 2020) in text-based
conversation. We additionally investigate and incorporate safety components in
our final model, and show that such efforts do not diminish model performance
with respect to engagingness metrics.
Related papers
- Body of Her: A Preliminary Study on End-to-End Humanoid Agent [0.8702432681310401]
We propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors.
This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.
arXiv Detail & Related papers (2024-08-06T01:13:09Z) - Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models [14.878276985702685]
In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models.
We define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue.
We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them.
arXiv Detail & Related papers (2024-06-20T06:56:19Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - TOD-Flow: Modeling the Structure of Task-Oriented Dialogues [77.15457469745364]
We propose a novel approach focusing on inferring the TOD-Flow graph from dialogue data annotated with dialog acts.
The inferred TOD-Flow graph can be easily integrated with any dialogue model to improve its prediction performance, transparency, and controllability.
arXiv Detail & Related papers (2023-12-07T20:06:23Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - A Probabilistic Model Of Interaction Dynamics for Dyadic Face-to-Face
Settings [1.9544213396776275]
We develop a probabilistic model to capture the interaction dynamics between pairs of participants in a face-to-face setting.
This interaction encoding is then used to influence the generation when predicting one agent's future dynamics.
We show that our model successfully delineates between the modes, based on their interacting dynamics.
arXiv Detail & Related papers (2022-07-10T23:31:27Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z) - Building Goal-Oriented Dialogue Systems with Situated Visual Context [12.014793558784955]
With the surge of virtual assistants with screen, the next generation of agents are required to understand screen context.
We propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context.
Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity.
arXiv Detail & Related papers (2021-11-22T23:30:52Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.