TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning
- URL: http://arxiv.org/abs/2508.08098v2
- Date: Thu, 14 Aug 2025 17:38:47 GMT
- Title: TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning
- Authors: Junzhe Xu, Yuyang Yin, Xi Chen,
- Abstract summary: This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation.<n>We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM)
- Score: 4.055271388591777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM's understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.
Related papers
- PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs [59.78917775399492]
Multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability.<n>We propose a training-free framework to mitigate this degradation.
arXiv Detail & Related papers (2026-01-12T15:27:51Z) - ThinkGen: Generalized Thinking for Visual Generation [97.19923474851987]
ThinkGen is a think-driven visual generation framework that explicitly leverages Chain-of-Thought (CoT) reasoning in various generation scenarios.<n>We propose a separable GRPO-based training paradigm, alternating reinforcement learning between the MLLM and DiT modules.<n>Experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks.
arXiv Detail & Related papers (2025-12-29T16:08:50Z) - Think Then Embed: Generative Context Improves Multimodal Embedding [51.76690812535934]
We propose a Think-Then-Embed (TTE) framework for Universal Multimodal Embeddings (UME), composed of a reasoner and an embedder.<n>By leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets.
arXiv Detail & Related papers (2025-10-06T16:53:56Z) - Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents [55.82787697101274]
Bifrost-1 is a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models.<n>By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation.<n>Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding.
arXiv Detail & Related papers (2025-08-08T02:38:47Z) - Discrete Diffusion in Large Language and Multimodal Models: A Survey [56.31088116526825]
We provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs)<n>Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm.<n>We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, and categorize representative models.
arXiv Detail & Related papers (2025-06-16T17:59:08Z) - Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens [66.02261367232256]
Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation.<n>Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order.<n>In this paper, we build a proper visual language by reconstructing diffusion timesteps to learn discrete visual tokens.
arXiv Detail & Related papers (2025-04-20T16:14:28Z) - Unifying Autoregressive and Diffusion-Based Sequence Generation [3.1853022872760186]
We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models.<n>We introduce hyperschedules, which assign distinct noise schedules to individual token positions.<n>Second, we propose two hybrid token-wise noising processes that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes.
arXiv Detail & Related papers (2025-04-08T20:32:10Z) - ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability [14.703591553247948]
ARMOR is a resource-efficient and pure autoregressive framework for multimodal large language models.<n>It achieves both understanding and generation by fine-tuning existing MLLMs.<n>We show that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources.
arXiv Detail & Related papers (2025-03-09T10:15:39Z) - Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z) - Streamlining Redundant Layers to Compress Large Language Models [21.27944103424621]
This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs)<n>It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.<n>Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.
arXiv Detail & Related papers (2024-03-28T04:12:13Z) - Learning Joint Latent Space EBM Prior Model for Multi-layer Generator [44.4434704520236]
We study the fundamental problem of learning multi-layer generator models.
We propose an energy-based model (EBM) on the joint latent space over all layers of latent variables.
Our experiments demonstrate that the learned model can be expressive in generating high-quality images.
arXiv Detail & Related papers (2023-06-10T00:27:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.