An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation
- URL: http://arxiv.org/abs/2408.08650v1
- Date: Fri, 16 Aug 2024 10:33:19 GMT
- Title: An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation
- Authors: Peiming Guo, Sinuo Liu, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Min Zhang,
- Abstract summary: Photo-sharing multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment.
A pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task.
We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model.
- Score: 43.139415423751615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Photo-Sharing Multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment. Using image text caption as the bridge, a pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task. However, representing the images with text captions may loss important visual details and information and cause error propagation in the complex dialogue system. Besides, the pipeline model isolates the three models separately because discrete image text captions hinder end-to-end gradient propagation. We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model. The large language model employs the Q-Former to perceive visual images in the input end. For image generation in the output end, we propose a dynamic vocabulary transformation matrix and use straight-through and gumbel-softmax techniques to align the large language model and stable diffusion model and achieve end-to-end gradient propagation. We perform experiments on PhotoChat and DialogCC datasets to evaluate our end-to-end model. Compared with pipeline models, the end-to-end model gains state-of-the-art performances on various metrics of text and image generation. More analysis experiments also verify the effectiveness of the end-to-end model for photo-sharing multi-modal dialogue generation.
Related papers
- Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting [28.65445105418749]
We introduce Anywhere, a pioneering multi-agent framework designed to address challenges in inpainting foreground images.
Anywhere employs various agents such as Visual Language Model, Large Language Model, and image generation models.
It excels in foreground-conditioned image inpainting, mitigating "over-imagination", resolving foreground-background discrepancies, and enhancing diversity.
arXiv Detail & Related papers (2024-04-29T11:13:37Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Multimodal Dialogue Response Generation [27.611204319057393]
We present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response.
We consider multimodal dialogue generation under a natural assumption that only limited training examples are available.
In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire model.
arXiv Detail & Related papers (2021-10-16T08:52:26Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.