Related papers: De-Diffusion Makes Text a Strong Cross-Modal Interface

De-Diffusion Makes Text a Strong Cross-Modal Interface

URL: http://arxiv.org/abs/2311.00618v1
Date: Wed, 1 Nov 2023 16:12:40 GMT
Title: De-Diffusion Makes Text a Strong Cross-Modal Interface
Authors: Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu
Abstract summary: We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. Experiments validate the precision and comprehensiveness of De-Diffusion text representing images. A single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools.
Score: 33.90004746543745
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input -- a process we term De-Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples.

Related papers

EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models [31.31018600797305]
We propose a prompt inversion technique called sys for text-to-image diffusion models.<n>Our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability.
arXiv Detail & Related papers (2025-06-03T16:44:15Z)
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z)
Decoder-Only LLMs are Better Controllers for Diffusion Models [63.22040456010123]
We propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models. Our adapter module is superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
arXiv Detail & Related papers (2025-02-06T12:17:35Z)
TextCraftor: Your Text Encoder Can be Image Quality Controller [65.27457900325462]
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation. We propose a proposed fine-tuning approach, TextCraftor, to enhance the performance of text-to-image diffusion models.
arXiv Detail & Related papers (2024-03-27T19:52:55Z)
Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning. Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z)
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model. Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder. By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z)
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models [12.06277444740134]
generic image manipulation using a single model with flexible text inputs is highly desirable. Recent work addresses this task by guiding generative models trained on the generic image using pretrained vision-language encoders. We propose an optimization-free method for the task of generic image manipulation from text prompts.
arXiv Detail & Related papers (2022-10-05T13:26:15Z)
Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss. Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.