FuguReport

Representation Forcing for Bottleneck-Free Unified Multimodal Models

Authors Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu
Affiliations The Chinese University of Hong Kong / Tsinghua University / Nanjing University / The University of Hong Kong / ByteDance
Categories Method / Representation Learning / Native representation prediction, Application / Multimodal Models / Unified perception and generation, Evaluation / Model Bottleneck Analysis / Structural bottleneck in multimodal models
License CC BY 4.0

Abstract Overview

This paper studies unified multimodal models that perform both visual understanding and image generation, and argues that existing systems retain a structural bottleneck because image generation depends on a frozen, separately pretrained VAE. The authors propose Representation Forcing (RF), which trains the decoder to autoregressively predict discrete visual representation tokens derived from the model’s own understanding encoder before generating pixels. These predicted representation tokens remain in the sequence and guide pixel-space diffusion within the same transformer backbone, removing the need for an external generative latent space. Experiments compare pixel-space and VAE-based variants under matched architecture, data, and training settings, showing effects on both text-to-image generation and image understanding.

Novelty

The distinctive idea is to make representation prediction a native decoder capability in a unified multimodal model, using discretized features from the jointly trained understanding encoder as intermediate generation targets. This replaces externally pretrained VAE latents with in-context representation tokens that connect perception and generation inside a single end-to-end framework.

Results

On text-to-image generation, the pixel-space RF model matches strong VAE-based unified baselines, reaching 0.84 GenEval and 84.15 DPG-Bench without an LLM rewriter, and 0.88 GenEval with a rewriter. On understanding, RF improves both pixel-space and VAE-based settings, with the pixel-space RF variant outperforming the VAE+RF variant on 6 of 8 reported benchmarks. Ablations also show RF is critical for pixel-space generation, raising GenEval from 0.25 to 0.76, while discrete representation prediction outperforms continuous regression and auxiliary alignment.

Key Points

  1. Representation Forcing inserts autoregressively predicted visual representation tokens between text and pixels, so pixel generation is guided by the model’s own understanding features rather than an external VAE latent space.
  2. Under controlled comparisons, the pixel-space RF model achieves generation quality comparable to state-of-the-art VAE-based unified models while operating directly in pixel space at up to 1024×1024 resolution.
  3. RF also improves multimodal understanding, especially on benchmarks tied to high-level visual semantics, and ablations indicate that discrete representation tokens are much more effective than continuous regression or auxiliary feature alignment.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.