FuguReport

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Authors Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong
Affiliations Meta / The University of Hong Kong / University of Waterloo
Categories Application / Multimodal Understanding / Visual understanding and generation, Method / Vision Models / Pixel embedding-based modeling, Evaluation / Multimodal Model Evaluation / Benchmark performance analysis
License CC BY 4.0

Abstract Overview

Tuna-2 is a native unified multimodal model that performs both visual understanding and visual generation directly in pixel space, without relying on pretrained vision encoders such as VAEs or representation encoders. Instead, it uses simple patch embedding layers and a single transformer decoder to jointly process image and text tokens. The paper also introduces a masking-based visual feature learning scheme to stabilize end-to-end training in high-dimensional pixel space and encourage more robust representations. The model is evaluated on multimodal understanding, text-to-image generation, image editing, and image reconstruction benchmarks, with results showing that the encoder-free pixel-space approach achieves state-of-the-art performance among 7B-scale native unified multimodal models on understanding tasks while remaining competitive on generation tasks.

Novelty

The main novelty is an encoder-free unified multimodal architecture that removes both the VAE and the representation encoder, replacing them with direct pixel patch embeddings inside a single transformer decoder. The work also contributes a masking-based training strategy tailored to pixel-space unified multimodal learning and provides a controlled comparison against an encoder-based pixel-space variant (Tuna-R), revealing that the encoder-free design surpasses the encoder-based variant on understanding after sufficient pretraining while converging more slowly in early training.

Results

On multimodal understanding benchmarks, Tuna-2 achieves state-of-the-art results among 7B-scale native unified models, outperforming both Tuna-R and prior latent-space models, with particular gains on fine-grained, pixel-centric tasks such as OCRBench, CountBench, and VisuLogic. On generation benchmarks (GenEval overall 0.87, DPG-Bench overall 86.54), Tuna-2 remains competitive with state-of-the-art unified models though Tuna-R is slightly stronger on benchmark scores; in LLM-judge evaluations, Tuna-2 is notably preferred for diversity (48.4% by GPT-5.4, 41.9% by Claude Opus 4.7) while maintaining competitive quality. Image reconstruction results rank first among unified tokenizers (rFID 0.15, PSNR 32.80, SSIM 0.93), approaching specialized tokenizers.

Key Points

  1. Tuna-2 removes modular vision encoders entirely—both VAE and representation encoder—and performs multimodal understanding and generation directly from raw pixels using patch embeddings and a unified transformer decoder, achieving state-of-the-art results among 7B-scale native unified multimodal models on understanding benchmarks.
  2. A masking-based visual feature learning scheme applied during the final 40% of pretraining improves both understanding and generation performance for both the encoder-free (Tuna-2) and encoder-based (Tuna-R) variants, with Tuna-2 benefiting more substantially from this strategy.
  3. Controlled comparisons between Tuna-2 and Tuna-R reveal that the encoder-based variant converges faster in early pretraining due to pretrained semantic priors, but the encoder-free Tuna-2 eventually surpasses it on understanding tasks at scale, while generation performance remains competitive between the two variants.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.