Teaching Text-to-Image Models to Communicate in Dialog
- URL: http://arxiv.org/abs/2309.15516v2
- Date: Thu, 8 Feb 2024 04:06:33 GMT
- Title: Teaching Text-to-Image Models to Communicate in Dialog
- Authors: Xiaowen Sun, Jiazhan Feng, Yuxuan Wang, Yuxuan Lai, Xingyu Shen,
Dongyan Zhao
- Abstract summary: In this paper, we focus on the innovative dialog-to-image generation task.
To tackle this problem, we design a tailored fine-tuning approach on the top of state-of-the-art text-to-image generation models.
Our approach brings consistent and remarkable improvement with 3 state-of-the-art pre-trained text-to-image generation backbones.
- Score: 44.76942024105259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A picture is worth a thousand words, thus, it is crucial for conversational
agents to understand, perceive, and effectively respond with pictures. However,
we find that directly employing conventional image generation techniques is
inadequate for conversational agents to produce image responses effectively. In
this paper, we focus on the innovative dialog-to-image generation task, where
the model synthesizes a high-resolution image aligned with the given dialog
context as a response. To tackle this problem, we design a tailored fine-tuning
approach on the top of state-of-the-art text-to-image generation models to
fully exploit the structural and semantic features in dialog context during
image generation. Concretely, we linearize the dialog context with specific
indicators to maintain the dialog structure, and employ in-domain data to
alleviate the style mismatch between dialog-to-image and conventional image
generation tasks. Empirical results on PhotoChat and MMDialog Corpus show that
our approach brings consistent and remarkable improvement with 3
state-of-the-art pre-trained text-to-image generation backbones.
Related papers
- BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation [21.052101309555464]
Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both.
Previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach.
We propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content.
arXiv Detail & Related papers (2024-08-12T05:22:42Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics.
We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context.
Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z) - IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue.
We propose a two-stage approach to automatically construct a multi-modal dialogue dataset.
In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image.
In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering.
We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z) - Multimodal Dialogue Response Generation [27.611204319057393]
We present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response.
We consider multimodal dialogue generation under a natural assumption that only limited training examples are available.
In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire model.
arXiv Detail & Related papers (2021-10-16T08:52:26Z) - Stylized Dialogue Response Generation Using Stylized Unpaired Texts [63.69880979112312]
This paper proposes a stylized dialogue generation method that can capture stylistic features embedded in unpaired texts.
Our method can produce dialogue responses that are both coherent to the given context and conform to the target style.
arXiv Detail & Related papers (2020-09-27T01:04:06Z) - Controlling Dialogue Generation with Semantic Exemplars [55.460082747572734]
We present an Exemplar-based Dialogue Generation model, EDGE, that uses the semantic frames present in exemplar responses to guide generation.
We show that controlling dialogue generation based on the semantic frames of exemplars, rather than words in the exemplar itself, improves the coherence of generated responses.
arXiv Detail & Related papers (2020-08-20T17:02:37Z) - Open Domain Dialogue Generation with Latent Images [43.78366219197779]
We propose learning a response generation model with both image-grounded dialogues and textual dialogues.
In the first scenario, image-grounded dialogues can be effectively augmented by textual dialogues with latent images.
In the second scenario, latent images can enrich the content of responses and at the same time keep them relevant to contexts.
arXiv Detail & Related papers (2020-04-04T17:32:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.