Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Abstract Overview
Tuna-2 is a native unified multimodal model that performs both visual understanding and visual generation directly in pixel space, without relying on pretrained vision encoders such as VAEs or representation encoders. Instead, it uses simple patch embedding layers and a single transformer decoder to jointly process image and text tokens. The paper also introduces a masking-based visual feature learning scheme to stabilize end-to-end training in high-dimensional pixel space and encourage more robust representations. The model is evaluated on multimodal understanding, text-to-image generation, image editing, and image reconstruction benchmarks, with results showing that the encoder-free pixel-space approach achieves state-of-the-art performance among 7B-scale native unified multimodal models on understanding tasks while remaining competitive on generation tasks.
Novelty
The main novelty is an encoder-free unified multimodal architecture that removes both the VAE and the representation encoder, replacing them with direct pixel patch embeddings inside a single transformer decoder. The work also contributes a masking-based training strategy tailored to pixel-space unified multimodal learning and provides a controlled comparison against an encoder-based pixel-space variant (Tuna-R), revealing that the encoder-free design surpasses the encoder-based variant on understanding after sufficient pretraining while converging more slowly in early training.
Results
On multimodal understanding benchmarks, Tuna-2 achieves state-of-the-art results among 7B-scale native unified models, outperforming both Tuna-R and prior latent-space models, with particular gains on fine-grained, pixel-centric tasks such as OCRBench, CountBench, and VisuLogic. On generation benchmarks (GenEval overall 0.87, DPG-Bench overall 86.54), Tuna-2 remains competitive with state-of-the-art unified models though Tuna-R is slightly stronger on benchmark scores; in LLM-judge evaluations, Tuna-2 is notably preferred for diversity (48.4% by GPT-5.4, 41.9% by Claude Opus 4.7) while maintaining competitive quality. Image reconstruction results rank first among unified tokenizers (rFID 0.15, PSNR 32.80, SSIM 0.93), approaching specialized tokenizers.
Key Points
- Tuna-2 removes modular vision encoders entirely—both VAE and representation encoder—and performs multimodal understanding and generation directly from raw pixels using patch embeddings and a unified transformer decoder, achieving state-of-the-art results among 7B-scale native unified multimodal models on understanding benchmarks.
- A masking-based visual feature learning scheme applied during the final 40% of pretraining improves both understanding and generation performance for both the encoder-free (Tuna-2) and encoder-based (Tuna-R) variants, with Tuna-2 benefiting more substantially from this strategy.
- Controlled comparisons between Tuna-2 and Tuna-R reveal that the encoder-based variant converges faster in early pretraining due to pretrained semantic priors, but the encoder-free Tuna-2 eventually surpasses it on understanding tasks at scale, while generation performance remains competitive between the two variants.
References
- arXiv: https://arxiv.org/abs/2604.24763v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.24763v1
- Hugging Face Papers: https://huggingface.co/papers/2604.24763
- Project: https://tuna-ai.org/tuna-2