Representation Forcing for Bottleneck-Free Unified Multimodal Models
Abstract Overview
This paper studies unified multimodal models that perform both visual understanding and image generation, and argues that existing systems retain a structural bottleneck because image generation depends on a frozen, separately pretrained VAE. The authors propose Representation Forcing (RF), which trains the decoder to autoregressively predict discrete visual representation tokens derived from the model’s own understanding encoder before generating pixels. These predicted representation tokens remain in the sequence and guide pixel-space diffusion within the same transformer backbone, removing the need for an external generative latent space. Experiments compare pixel-space and VAE-based variants under matched architecture, data, and training settings, showing effects on both text-to-image generation and image understanding.
Novelty
The distinctive idea is to make representation prediction a native decoder capability in a unified multimodal model, using discretized features from the jointly trained understanding encoder as intermediate generation targets. This replaces externally pretrained VAE latents with in-context representation tokens that connect perception and generation inside a single end-to-end framework.
Results
On text-to-image generation, the pixel-space RF model matches strong VAE-based unified baselines, reaching 0.84 GenEval and 84.15 DPG-Bench without an LLM rewriter, and 0.88 GenEval with a rewriter. On understanding, RF improves both pixel-space and VAE-based settings, with the pixel-space RF variant outperforming the VAE+RF variant on 6 of 8 reported benchmarks. Ablations also show RF is critical for pixel-space generation, raising GenEval from 0.25 to 0.76, while discrete representation prediction outperforms continuous regression and auxiliary alignment.
Key Points
- Representation Forcing inserts autoregressively predicted visual representation tokens between text and pixels, so pixel generation is guided by the model’s own understanding features rather than an external VAE latent space.
- Under controlled comparisons, the pixel-space RF model achieves generation quality comparable to state-of-the-art VAE-based unified models while operating directly in pixel space at up to 1024×1024 resolution.
- RF also improves multimodal understanding, especially on benchmarks tied to high-level visual semantics, and ablations indicate that discrete representation tokens are much more effective than continuous regression or auxiliary feature alignment.