VD-BERT: A Unified Vision and Dialog Transformer with BERT
- URL: http://arxiv.org/abs/2004.13278v3
- Date: Mon, 2 Nov 2020 09:07:41 GMT
- Title: VD-BERT: A Unified Vision and Dialog Transformer with BERT
- Authors: Yue Wang, Shafiq Joty, Michael R. Lyu, Irwin King, Caiming Xiong,
Steven C.H. Hoi
- Abstract summary: We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer.
We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training.
Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
- Score: 161.0016161052714
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual dialog is a challenging vision-language task, where a dialog agent
needs to answer a series of questions through reasoning on the image content
and dialog history. Prior work has mostly focused on various attention
mechanisms to model such intricate interactions. By contrast, in this work, we
propose VD-BERT, a simple yet effective framework of unified vision-dialog
Transformer that leverages the pretrained BERT language models for Visual
Dialog tasks. The model is unified in that (1) it captures all the interactions
between the image and the multi-turn dialog using a single-stream Transformer
encoder, and (2) it supports both answer ranking and answer generation
seamlessly through the same architecture. More crucially, we adapt BERT for the
effective fusion of vision and dialog contents via visually grounded training.
Without the need of pretraining on external vision-language data, our model
yields new state of the art, achieving the top position in both single-model
and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog
leaderboard. Our code and pretrained models are released at
https://github.com/salesforce/VD-BERT.
Related papers
- Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images.
Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images.
We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions.
We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z) - IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue.
We propose a two-stage approach to automatically construct a multi-modal dialogue dataset.
In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image.
In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z) - DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for
Dialog Response Generation [80.45816053153722]
DialogVED introduces continuous latent variables into the enhanced encoder-decoder pre-training framework to increase the relevance and diversity of responses.
We conduct experiments on PersonaChat, DailyDialog, and DSTC7-AVSD benchmarks for response generation.
arXiv Detail & Related papers (2022-04-27T16:18:15Z) - Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering.
We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z) - VU-BERT: A Unified framework for Visual Dialog [34.4815433301286]
We propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding in visual dialog tasks.
The model is trained over two tasks: masked language modeling and next utterance retrieval.
arXiv Detail & Related papers (2022-02-22T10:20:14Z) - Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation [35.45552689723718]
We propose frameworks to resolve a specific case of multi-modal dialog generation in the real world.
Specifically, we propose to model the mutual dependency between text-visual features.
We observe significant performance boosts over vanilla models when the mutual dependency between text and visual features is modeled.
arXiv Detail & Related papers (2021-05-30T07:20:28Z) - The Adapter-Bot: All-In-One Controllable Conversational Model [66.48164003532484]
We propose a dialogue model that uses a fixed backbone model such as DialGPT and triggers on-demand dialogue skills via different adapters.
Depending on the skills, the model is able to process multiple knowledge types, such as text, tables, and emphatic responses.
We evaluate our model using automatic evaluation by comparing it with existing state-of-the-art conversational models.
arXiv Detail & Related papers (2020-08-28T10:59:31Z) - TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented
Dialogue [113.45485470103762]
In this work, we unify nine human-human and multi-turn task-oriented dialogue datasets for language modeling.
To better model dialogue behavior during pre-training, we incorporate user and system tokens into the masked language modeling.
arXiv Detail & Related papers (2020-04-15T04:09:05Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.