Related papers: UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

URL: http://arxiv.org/abs/2502.05415v2
Date: Sun, 18 May 2025 14:59:21 GMT
Title: UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding
Authors: Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng,
Abstract summary: consistency models (CMs) have shown promise in efficient generation of both image and text.<n>Key challenge is establishing a unified denoising perspective for both image and text generation.<n>In text-to-image generation, UniCMs outperform SD3 on GenEval, Image Reward, and CLIP Score metrics.<n>In image-to-text generation, UniCMs surpass Show-o on the MMMU benchmark while being $1.5 times$ faster at long-sequence generating speed.
Score: 12.34529497235534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Consistency models (CMs) have shown promise in the efficient generation of both image and text. This raises the natural question of whether we can learn a unified CM for efficient multimodal generation (e.g., text-to-image) and understanding (e.g., image-to-text). Intuitively, such a model could be acquired by applying the consistency distillation (CD) to existing unified multimodal models. However, the key challenge is establishing a unified denoising perspective for both image and text generation, which is essential for establishing the consistency mapping. To tackle this, at the representation level, we advocate for discrete tokens for both modalities to best preserve language modeling capabilities. Critically, instead of defining the text denoising trajectory via recent discrete diffusion language modeling principles, we specify it using the parallel decoding trace of an autoregressive language model, benefiting from the latter's superior performance in general text generation tasks. The denoising trajectory of image tokens adheres to standard discrete diffusion. We train our unified consistency models (UniCMs) on these combined multimodal trajectories simultaneously with a unified objective. We introduce a trajectory segmentation strategy to further improve the training convergence. Empirically, in text-to-image generation, UniCMs outperform SD3 on GenEval, Image Reward, and CLIP Score metrics, while requiring only approximately ${1}/{8}$ of the sampling time. Meanwhile, in image-to-text generation, UniCMs surpass Show-o on the MMMU benchmark while being $1.5 \times$ faster at long-sequence generating speed. The code is available at https://github.com/zhijie-group/UniCMs.

Related papers

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations [33.11867433769496]
This paper presents a framework that attempts to unify visual understanding and generation within a shared semantic representation.<n>At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary.<n> Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency.
arXiv Detail & Related papers (2025-06-23T17:59:14Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens [66.02261367232256]
Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation.<n>Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order.<n>In this paper, we build a proper visual language by reconstructing diffusion timesteps to learn discrete visual tokens.
arXiv Detail & Related papers (2025-04-20T16:14:28Z)
Unified Multimodal Discrete Diffusion [78.48930545306654]
Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches.<n>We explore discrete diffusion models as a unified generative formulation in the joint text and image domain.<n>We present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images.
arXiv Detail & Related papers (2025-03-26T17:59:51Z)
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z)
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z)
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation [46.5013105017258]
Diffusion models are trained by denoising a Markovian process which gradually adds noise to the input.<n>We propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework.
arXiv Detail & Related papers (2024-10-10T17:41:54Z)
Multi-modal Generation via Cross-Modal In-Context Learning [50.45304937804883]
We propose a Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that generates novel images from complex multimodal prompt sequences. Our MGCC demonstrates a diverse range of multimodal capabilities, like novel image generation, the facilitation of multimodal dialogue, and generation of texts.
arXiv Detail & Related papers (2024-05-28T15:58:31Z)
FIFO-Diffusion: Generating Infinite Videos from Text without Training [44.65468310143439]
FIFO-Diffusion is conceptually capable of generating infinitely long videos without additional training. Our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines.
arXiv Detail & Related papers (2024-05-19T07:48:41Z)
TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation [44.740794326596664]
TheaterGen is a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models. Within this framework, LLMs, acting as "Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book. With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images.
arXiv Detail & Related papers (2024-04-29T17:58:14Z)
UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs. It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z)
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training [33.51524424536508]
Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback. We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
arXiv Detail & Related papers (2023-12-23T11:10:43Z)
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images. It is the first multi-modal model trained with a recipe adapted from text-only language models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z)
AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation [138.98095392584693]
We introduce Auto-Regressive Diffusion (AR-Diffusion) to account for the inherent sequential characteristic of natural language. AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps. In a series of experiments on various text generation tasks, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models.
arXiv Detail & Related papers (2023-05-16T15:10:22Z)
Shifted Diffusion for Text-to-image Generation [65.53758187995744]
Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text. Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks.
arXiv Detail & Related papers (2022-11-24T03:25:04Z)
eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. We train an ensemble of text-to-image diffusion models specialized for different stages synthesis. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention. We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
TIME: Text and Image Mutual-Translation Adversarial Networks [55.1298552773457]
We propose Text and Image Mutual-Translation Adversarial Networks (TIME) TIME learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. In experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset.
arXiv Detail & Related papers (2020-05-27T06:40:12Z)
Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL) Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.