Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
- URL: http://arxiv.org/abs/2105.14445v1
- Date: Sun, 30 May 2021 07:20:28 GMT
- Title: Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
- Authors: Shuhe Wang, Yuxian Meng, Xiaofei Sun, Fei Wu, Rongbin Ouyang, Rui Yan,
Tianwei Zhang, Jiwei Li
- Abstract summary: We propose frameworks to resolve a specific case of multi-modal dialog generation in the real world.
Specifically, we propose to model the mutual dependency between text-visual features.
We observe significant performance boosts over vanilla models when the mutual dependency between text and visual features is modeled.
- Score: 35.45552689723718
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal dialog modeling is of growing interest. In this work, we propose
frameworks to resolve a specific case of multi-modal dialog generation that
better mimics multi-modal dialog generation in the real world, where each
dialog turn is associated with the visual context in which it takes place.
Specifically, we propose to model the mutual dependency between text-visual
features, where the model not only needs to learn the probability of generating
the next dialog utterance given preceding dialog utterances and visual
contexts, but also the probability of predicting the visual features in which a
dialog utterance takes place, leading the generated dialog utterance specific
to the visual context. We observe significant performance boosts over vanilla
models when the mutual dependency between text and visual features is modeled.
Code is available at https://github.com/ShannonAI/OpenViDial.
Related papers
- Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - Contextual Data Augmentation for Task-Oriented Dialog Systems [8.085645180329417]
We develop a novel dialog augmentation model that generates a user turn, conditioning on full dialog context.
With a new prompt design for language model, and output re-ranking, the dialogs generated from our model can be directly used to train downstream dialog systems.
arXiv Detail & Related papers (2023-10-16T13:22:34Z) - SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog
Understanding and Generation [123.37377363355363]
SPACE-3 is a novel unified semi-supervised pre-trained conversation model learning from large-scale dialog corpora.
It can be effectively fine-tuned on a wide range of downstream dialog tasks.
Results show that SPACE-3 achieves state-of-the-art performance on eight downstream dialog benchmarks.
arXiv Detail & Related papers (2022-09-14T14:17:57Z) - Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering.
We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z) - Building Goal-Oriented Dialogue Systems with Situated Visual Context [12.014793558784955]
With the surge of virtual assistants with screen, the next generation of agents are required to understand screen context.
We propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context.
Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity.
arXiv Detail & Related papers (2021-11-22T23:30:52Z) - OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset
with Visual Contexts [20.37658842432543]
We release OpenViDial 2.0, a larger-scale open-domain multi-modal dialogue dataset.
OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series.
arXiv Detail & Related papers (2021-09-27T02:10:29Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - VD-BERT: A Unified Vision and Dialog Transformer with BERT [161.0016161052714]
We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer.
We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training.
Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
arXiv Detail & Related papers (2020-04-28T04:08:46Z) - Paraphrase Augmented Task-Oriented Dialog Generation [68.1790912977053]
We propose a paraphrase augmented response generation (PARG) framework that jointly trains a paraphrase model and a response generation model.
We also design a method to automatically construct paraphrase training data set based on dialog state and dialog act labels.
arXiv Detail & Related papers (2020-04-16T05:12:36Z) - Conversation Learner -- A Machine Teaching Tool for Building Dialog
Managers for Task-Oriented Dialog Systems [57.082447660944965]
Conversation Learner is a machine teaching tool for building dialog managers.
It enables dialog authors to create a dialog flow using familiar tools, converting the dialog flow into a parametric model.
It allows dialog authors to improve the dialog manager over time by leveraging user-system dialog logs as training data.
arXiv Detail & Related papers (2020-04-09T00:10:54Z) - Variational Hierarchical Dialog Autoencoder for Dialog State Tracking
Data Augmentation [59.174903564894954]
In this work, we extend this approach to the task of dialog state tracking for goal-oriented dialogs.
We propose the Variational Hierarchical Dialog Autoencoder (VHDA) for modeling the complete aspects of goal-oriented dialogs.
Experiments on various dialog datasets show that our model improves the downstream dialog trackers' robustness via generative data augmentation.
arXiv Detail & Related papers (2020-01-23T15:34:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.