M-VADER: A Model for Diffusion with Multimodal Context
- URL: http://arxiv.org/abs/2212.02936v2
- Date: Wed, 7 Dec 2022 09:11:18 GMT
- Title: M-VADER: A Model for Diffusion with Multimodal Context
- Authors: Samuel Weinbach, Marco Bellagente, Constantin Eichenberg, Andrew Dai,
Robert Baldock, Souradeep Nanda, Bj\"orn Deiseroth, Koen Oostermeijer, Hannah
Teufel, Andres Felipe Cruz-Salinas
- Abstract summary: We show how M-VADER enables the generation of images specified using combinations of image and text.
We introduce an embedding model closely related to a vision-language model.
- Score: 0.786460153386845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce M-VADER: a diffusion model (DM) for image generation where the
output can be specified using arbitrary combinations of images and text. We
show how M-VADER enables the generation of images specified using combinations
of image and text, and combinations of multiple images. Previously, a number of
successful DM image generation algorithms have been introduced that make it
possible to specify the output image using a text prompt. Inspired by the
success of those models, and led by the notion that language was already
developed to describe the elements of visual contexts that humans find most
important, we introduce an embedding model closely related to a vision-language
model. Specifically, we introduce the embedding model S-MAGMA: a 13 billion
parameter multimodal decoder combining components from an autoregressive
vision-language model MAGMA and biases finetuned for semantic search.
Related papers
- MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation [43.139415423751615]
Photo-sharing multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment.
A pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task.
We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model.
arXiv Detail & Related papers (2024-08-16T10:33:19Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - 3M: Multi-style image caption generation using Multi-modality features
under Multi-UPDOWN model [8.069209836624495]
We propose the 3M model, a Multi-UPDOWN caption model that encodes multi-modality features and decodes them to captions.
We demonstrate the effectiveness of our model on generating human-like captions by examining its performance on two datasets.
arXiv Detail & Related papers (2021-03-20T14:12:13Z) - Fusion Models for Improved Visual Captioning [18.016295296424413]
This paper proposes a generic multimodal model fusion framework for caption generation and emendation.
We employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM) with a visual captioning model, viz. Show, Attend, and Tell.
Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline.
arXiv Detail & Related papers (2020-10-28T21:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.